250x250
반응형
Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | 31 |
Tags
- top_k
- grad-cam
- GenericGBQException
- session 유지
- XAI
- Retry
- flask
- airflow subdag
- correlation
- requests
- gather_nd
- 상관관계
- API Gateway
- login crawling
- Counterfactual Explanations
- youtube data
- BigQuery
- API
- 공분산
- GCP
- hadoop
- TensorFlow
- integrated gradient
- subdag
- chatGPT
- tensorflow text
- spark udf
- UDF
- 유튜브 API
- Airflow
Archives
- Today
- Total
데이터과학 삼학년
bi-directional 어텐션 메카니즘 vs bi-directional 모델 (네이버 영화리뷰) 본문
Natural Language Processing
bi-directional 어텐션 메카니즘 vs bi-directional 모델 (네이버 영화리뷰)
Dan-k 2020. 6. 23. 14:21반응형
어텐션 메카니즘은 특정 vector에 더욱 집중을 둠으로써 모델의 성능을 향상시키는 방법이다.
여기서 다루는 어텐션은 lstm 모델의 히든스테이트에 가중치를 더욱 주는 모델로 결정하여 분석하였다.
양방향 모델과 양방향 모델에 어텐션 메카니즘을 추가한 모델을 비교한 결과 어텐션메카니즘이 보다 나은 성능을 나타냄을 확인하였다.
import numpy as np
import pandas as pd
from soynlp.tokenizer import MaxScoreTokenizer
from soynlp.word import WordExtractor
from soynlp.tokenizer import LTokenizer
import os
import shutil
import urllib
import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras import optimizers
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping
from tensorflow.keras.layers import (
Dense,
Embedding,
Bidirectional,
Flatten,
LSTM,
GRU,
Conv1D,
Lambda,
SimpleRNN,
Concatenate,
Layer,
Dropout,
BatchNormalization
)
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
print(tf.__version__)
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
In [2]:
tf.keras.layers.LSTM
Out[2]:
In [3]:
LOGDIR = "./text_models"
1.Preprocess¶
In [4]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt", filename="ratings_train.txt")
urllib.request.urlretrieve("https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt", filename="ratings_test.txt")
Out[4]:
In [5]:
train_data = pd.read_table('ratings_train.txt')
test_data = pd.read_table('ratings_test.txt')
In [6]:
train_data.head()
Out[6]:
In [7]:
train_data['document'] = train_data['document'].apply(str)
In [8]:
corpus = train_data.document.apply(str)
word_extractor = WordExtractor()
word_extractor.train(corpus)
word_score = word_extractor.extract()
scores = {word:score.cohesion_forward for word, score in word_score.items()}
maxscore_tokenizer = MaxScoreTokenizer(scores=scores)
def soynlp_morphs(contents):
return ' '.join(maxscore_tokenizer.tokenize(contents))
In [9]:
%%time
train_data['soynlp_morphs_contents'] = train_data['document'].apply(soynlp_morphs)
In [10]:
X = train_data.soynlp_morphs_contents
y = train_data.label
In [11]:
## X,y
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X) ### vovab을 만들고 각 단어마다 고유번호를 매긴 후 번호를 순서에 따라 매핑하는 방법
In [12]:
word_to_index = tokenizer.word_index
VOCAB_SIZE = len(word_to_index) + 1 ## padding 할때 필요한 0 index 추가
In [13]:
MAX_LEN = max(len(seq) for seq in sequences)
MAX_LEN
Out[13]:
In [14]:
def encode_labels(sources):
classes = [source for source in sources]
one_hots = to_categorical(classes)
return one_hots
In [15]:
def create_sequences(texts, max_len=MAX_LEN):
sequences = tokenizer.texts_to_sequences(texts)
padded_sequences = pad_sequences(sequences, max_len, padding='post')
return padded_sequences
In [16]:
X_train, X_valid, y_train, y_valid = train_test_split(create_sequences(X), encode_labels(y), test_size=0.1, random_state=42)
In [17]:
N_CLASSES =2
In [18]:
y_train.shape
Out[18]:
In [19]:
VOCAB_SIZE
Out[19]:
2.Train¶
BahdanauAttention¶
In [20]:
class BahdanauAttention(tf.keras.Model):
def __init__(self, units):
super(BahdanauAttention, self).__init__()
self.W1 = Dense(units)
self.W2 = Dense(units)
self.V = Dense(1)
def call(self, values, query): # 단, key와 value는 같음
# hidden shape == (batch_size, hidden size)
# hidden_with_time_axis shape == (batch_size, 1, hidden size)
# we are doing this to perform addition to calculate the score
hidden_with_time_axis = tf.expand_dims(query, 1)
# score shape == (batch_size, max_length, 1)
# we get 1 at the last axis because we are applying score to self.V
# the shape of the tensor before applying self.V is (batch_size, max_length, units)
score = self.V(tf.nn.tanh(
self.W1(values) + self.W2(hidden_with_time_axis)))
# attention_weights shape == (batch_size, max_length, 1)
attention_weights = tf.nn.softmax(score, axis=1)
# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights * values
context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights
In [21]:
def build_attention_model(vocab_size, max_len, embed_dim):
sequence_input = Input(shape=(max_len,), dtype='int32')
embedded_sequences = Embedding(vocab_size+1, embed_dim, input_length=max_len)(sequence_input)
lstm, forward_h, forward_c, backward_h, backward_c = Bidirectional(LSTM(128,
dropout=0.3,
return_sequences=True,
return_state=True,
recurrent_activation='relu',
recurrent_initializer='glorot_uniform'
)
)(embedded_sequences)
print(lstm.shape, forward_h.shape, forward_c.shape, backward_h.shape, backward_c.shape)
state_h = Concatenate()([forward_h, backward_h]) # 은닉 상태
state_c = Concatenate()([forward_c, backward_c]) # 셀 상태
attention = BahdanauAttention(32) # 가중치 크기 정의
context_vector, attention_weights = attention(lstm, state_h)
hidden = BatchNormalization()(context_vector)
dense1 = Dense(20, activation="relu")(hidden)
dropout = Dropout(0.05)(dense1)
output = Dense(N_CLASSES, activation='softmax')(dropout)
model = Model(inputs=sequence_input, outputs=output)
Adam = optimizers.Adam(lr=0.001, clipnorm=1.)
model.compile(optimizer=Adam, loss='categorical_crossentropy', metrics=['accuracy'])
return model
In [22]:
def build_bisimpernn_model(vocab_size, embed_dim, max_len, units, N_CLASSES):
model = Sequential([
Embedding(vocab_size + 1, embed_dim, input_shape=[max_len]),
Bidirectional(SimpleRNN(units,
dropout=0.3)
),
Dense(N_CLASSES, activation='softmax')
])
model.compile(
optimizer=tf.keras.optimizers.Adam(
learning_rate=0.0005, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,
name='Adam'
),
loss='categorical_crossentropy',
metrics=['accuracy']
)
return model
bi-simpleRNN¶
In [23]:
%%time
tf.random.set_seed(33)
MODEL_DIR = os.path.join(LOGDIR, 'bidirecttional_lstm')
shutil.rmtree(MODEL_DIR, ignore_errors=True)
EPOCHS = 15
BATCH_SIZE = 300
EMBED_DIM = 100
UNITS = 16
PATIENCE = 0
# vocab_size, embed_dim, input_length, units, N_CLASSES
bisimplernn_model = build_bisimpernn_model(vocab_size=VOCAB_SIZE,
embed_dim=EMBED_DIM,
max_len=MAX_LEN,
units=UNITS,
N_CLASSES=N_CLASSES)
history = bisimplernn_model.fit(
X_train,
y_train,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
validation_data=(X_valid, y_valid),
# callbacks=[EarlyStopping(patience=PATIENCE), TensorBoard(MODEL_DIR)],
callbacks=[TensorBoard(MODEL_DIR)],
use_multiprocessing=True, ## OOM error out of memory error가 발생할때 True로 변경하여 준다
verbose=1
)
pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
pd.DataFrame(history.history)[['accuracy', 'val_accuracy']].plot()
bisimplernn_model.summary()
loss값이 epoch 1 이후로 발산하므로, early stopping을 이용해 stop 시켰어야함을 알 수 있음¶
In [24]:
tf.keras.utils.plot_model(bisimplernn_model, show_shapes=True, dpi=90)
Out[24]:
bi-SimpleRNN with Attention¶
In [25]:
%%time
tf.random.set_seed(33)
MODEL_DIR = os.path.join(LOGDIR, 'bidirecttional_lstm_attention')
shutil.rmtree(MODEL_DIR, ignore_errors=True)
EPOCHS = 15
BATCH_SIZE = 300
EMBED_DIM = 30
UNITS = 16
PATIENCE = 0
attention_model = build_attention_model(vocab_size=VOCAB_SIZE, max_len=MAX_LEN, embed_dim=EMBED_DIM)
history = attention_model.fit(
X_train,
y_train,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
validation_data=(X_valid, y_valid),
# callbacks=[EarlyStopping(patience=PATIENCE), TensorBoard(MODEL_DIR)],
callbacks=[TensorBoard(MODEL_DIR)],
use_multiprocessing=True, ## OOM error out of memory error가 발생할때 True로 변경하여 준다
verbose=1
)
pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
pd.DataFrame(history.history)[['accuracy', 'val_accuracy']].plot()
attention_model.summary()
In [26]:
tf.keras.utils.plot_model(attention_model, show_shapes=True, dpi=90)
Out[26]:
3.Test¶
In [27]:
test_data['document'] = test_data['document'].apply(str)
test_data['soynlp_morphs_contents'] = test_data['document'].apply(soynlp_morphs)
X_test=test_data.soynlp_morphs_contents
y_test=test_data.label
In [28]:
X_test, y_test = create_sequences(X_test), encode_labels(y_test)
In [29]:
def convert_argmax(array):
return np.argmax(array, axis=1)
Bi-derectional simpleRNN¶
In [30]:
bisimplernn_model_pred = bisimplernn_model.predict(X_test)
target_names = ['불건전', '건전']
print(confusion_matrix(convert_argmax(y_test), convert_argmax(bisimplernn_model_pred)))
print(classification_report(convert_argmax(y_test),
convert_argmax(bisimplernn_model_pred),
target_names=target_names))
Bi-derectional LSTM with attention¶
In [31]:
attention_model_pred = attention_model.predict(X_test)
target_names = ['불건전', '건전']
print(confusion_matrix(convert_argmax(y_test), convert_argmax(attention_model_pred)))
print(classification_report(convert_argmax(y_test),
convert_argmax(attention_model_pred),
target_names=target_names))
4.결과¶
- 불건전 컨텐츠를 분류하는 recall 비교에서
- bi-directional simple RNN
- bi-directional LSTM with Attention 의 성능이 0.84로 0.81인 bi-directional simple RNN보다 높음
- 학습시간
- bi-directional simple RNN : 40분
- bi-directional LSTM with Attention : 1시간 37분
- bi-directional LSTM with Attention > bi-directional simple RNN
- 종합결과
- 하이퍼파라미터 튜닝이나 양질의 데이터를 이용하여 다시 실험할 경우, 결과가 다를 수 있겠으나,
- attention 모델의 장점을 확인함
728x90
반응형
LIST
'Natural Language Processing' 카테고리의 다른 글
[tf 2.x] tf.keras 로 predict 결과 custom 하기 --> GCP ai-platform ( keyed model, serving_signature) (1) | 2020.07.08 |
---|---|
Text classification using CloudML (jupyter notebook with tf.keras) (1) | 2020.06.29 |
Word Embedding (0) | 2020.06.17 |
tf.keras (2.0) & soynlp를 이용한 텍스트 분류 (DNN, RNN, CNN) (0) | 2020.06.12 |
soynlp 한국어 형태소 분석기(학습형 형태소 분리기) (0) | 2020.06.08 |
Comments