TextRank for Text Summarization

250x250

Notice

Recent Posts

Recent Comments

Link

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tags more

Archives

Today

Total

관리 메뉴

데이터과학 삼학년

TextRank for Text Summarization 본문

Natural Language Processing

TextRank for Text Summarization

Dan-k 2022. 5. 4. 18:59

TextRank for Text Summarization

- extractive approach and is an unsupervised graph-based text summarization technique.

Summarization 방법

1. Extractive

- document내에서 가장 영향력이 큰 문장n개를 추출하여 요약을 구성

- 별도로 label된 summary dataset(ground truth)가 없어도 가능

- TextRank, LexRank 등

- ex) summary = 문장1, 문장2, 문장3

2. Abstractive

- 전체 문장에 대해 미리 label된 summary문을 가진 데이터를 이용해 seq-seq model로 추출

- attention mechanism, gnn based model, sequential model 등

- ex) gt label = '학교는 재밌어', summary='학교 가는 것은 즐거워'

텍스트랭크(TextRank)

- pageRank를 텍스트에 적용한 것을 textRank라고 말함

- 텍스트랭크에서 그래프의 노드들은 문장들이며, 각 edge의 가중치는 문장들 간의 유사도를 의미

문장 embedding

- word level embedding으로 진행하여 각 word vector의 합을 이용해 embedding 가능

- pre-trained model (glove, elmo, bert 등)을 이용하여 word embedding 가능

- 본 실험에서는 sentece embedding인 use-multilingual을 이용하여 문장 embedding

문장간 유사도 계산

- 문장간 유사도를 계산된 매트릭스 구성

- 10개의 문장이 존재한다면, 10x10의 매트릭스 구성됨

- 유사도 산정은 cosine, angular 등 모두 가능

네트워트 구성 및 pageRank score 계산

- 문장간 유사도 매트릭스를 이용해 그래프 구성

- pageRank metric을 이용해 문장유사도 그래프의 pageRank score 산정

상위 score를 가진 문장 추출

- 상위 pageRank scores를 가진 텍스트 추출하여 병합

코드

import tensorflow_hub as hub
import numpy as np
import tensorflow_text
import networkx as nx
from sklearn.metrics.pairwise import cosine_similarity

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")

def word_embedding(text):
   return embed(text).numpy()

df['embedding'] = df['message'].apply(word_embedding)


def similarity_matrix(sentence_embedding):
   sim_mat = np.zeros([len(sentence_embedding), len(sentence_embedding)])
   for i in range(len(sentence_embedding)):
       for j in range(len(sentence_embedding)):
           cosim = cosine_similarity(sentence_embedding[i].reshape(1, embedding_dim),
                                         sentence_embedding[j].reshape(1, embedding_dim))[0,0]
           if cosim < 0.2 :
               sim_mat[i][j] = 0
           else:
               sim_mat[i][j]= cosim
               
   return sim_mat


def draw_graphs(sim_matrix):
   nx_graph = nx.from_numpy_array(sim_matrix)
   plt.figure(figsize=(10, 10))
   pos = nx.spring_layout(nx_graph)
   nx.draw(nx_graph, with_labels=True, font_weight='bold')
   nx.draw_networkx_edge_labels(nx_graph,pos,font_color='red')
   plt.show()


def calculate_score(sim_matrix):
   nx_graph = nx.from_numpy_array(sim_matrix)
   scores = nx.pagerank_numpy(nx_graph)
   return scores


def ranked_sentences(sentences, scores, n=3):
   top_scores = sorted(((scores[i],s) for i,s in enumerate(sentences)),  reverse=True)
   top_n_sentences = [sentence for score,sentence in top_scores[:n]]
   return "\n".join(top_n_sentences)


def get_summarization(df):
   sim_mat = similarity_matrix(df['embedding'].values)
   df_score = calculate_score(sim_mat)
   summarization = ranked_sentences(df.translated_message, df_score)

   return summarization


print(f'요약: \n{get_summarization(df)}')

참조

https://wikidocs.net/91051

2) 문장 임베딩 기반 텍스트 랭크(TextRank Based on Sentence Embedding)

앞서 추상적 요약(abstractive summarization)을 통한 텍스트 요약을 수행해보았습니다. 이번 실습에서는 텍스트랭크(TextRank) 알고리즘으로 사용하여 ...

wikidocs.net

728x90

LIST

'Natural Language Processing' 카테고리의 다른 글

[크롤링] What is the differences between requests and selenium? (0)	2022.05.27
ROUGE : text summarization metric (0)	2022.05.09
텔레그램 챗 내용 export 및 parser (feat. beautifulsoup) (0)	2022.04.19
텔레그램봇을 활용한 유저 채팅 데이터 수집 및 활용(feat. telepot, telegram) (2)	2022.03.24
[크롤링] 로그인 후 게시판 목차의 링크를 받아와(n page 까지의 게시물 전체 링크) website 크롤링 (0)	2020.09.10