데이터과학 삼학년

bi-directional 어텐션 메카니즘 vs bi-directional 모델 (네이버 영화리뷰) 본문

Natural Language Processing

bi-directional 어텐션 메카니즘 vs bi-directional 모델 (네이버 영화리뷰)

Dan-k 2020. 6. 23. 14:21
반응형

어텐션 메카니즘특정 vector에 더욱 집중을 둠으로써 모델의 성능을 향상시키는 방법이다.

여기서 다루는 어텐션은 lstm 모델의 히든스테이트에 가중치를 더욱 주는 모델로 결정하여 분석하였다.

양방향 모델과 양방향 모델에 어텐션 메카니즘을 추가한 모델을 비교한 결과 어텐션메카니즘이 보다 나은 성능을 나타냄을 확인하였다.

import numpy as np
import pandas as pd

from soynlp.tokenizer import MaxScoreTokenizer
from soynlp.word import WordExtractor
from soynlp.tokenizer import LTokenizer

import os
import shutil
import urllib

import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras import optimizers
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping
from tensorflow.keras.layers import (
    Dense,
    Embedding,
    Bidirectional,
    Flatten,
    LSTM,
    GRU,
    Conv1D,
    Lambda,
    SimpleRNN,
    Concatenate,
    Layer,
    Dropout,
    BatchNormalization
)
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical



print(tf.__version__)

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
 
2.0.0
In [2]:
tf.keras.layers.LSTM
Out[2]:
tensorflow.python.keras.layers.recurrent_v2.LSTM
In [3]:
LOGDIR = "./text_models"
 

1.Preprocess

In [4]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt", filename="ratings_train.txt")
urllib.request.urlretrieve("https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt", filename="ratings_test.txt")
Out[4]:
('ratings_test.txt', <http.client.HTTPMessage at 0x7fabbdcd6b38>)
In [5]:
train_data = pd.read_table('ratings_train.txt')
test_data = pd.read_table('ratings_test.txt')
In [6]:
train_data.head()
Out[6]:
  id document label
0 9976970 아 더빙.. 진짜 짜증나네요 목소리 0
1 3819312 흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나 1
2 10265843 너무재밓었다그래서보는것을추천한다 0
3 9045019 교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정 0
4 6483659 사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ... 1
In [7]:
train_data['document'] = train_data['document'].apply(str)
In [8]:
corpus = train_data.document.apply(str)
word_extractor = WordExtractor()
word_extractor.train(corpus)
word_score = word_extractor.extract()
scores = {word:score.cohesion_forward for word, score in word_score.items()}
maxscore_tokenizer = MaxScoreTokenizer(scores=scores)

def soynlp_morphs(contents):
    return ' '.join(maxscore_tokenizer.tokenize(contents))
 
training was done. used memory 0.876 Gbse memory 0.819 Gb
all cohesion probabilities was computed. # words = 85684
all branching entropies was computed # words = 101540
all accessor variety was computed # words = 101540
In [9]:
%%time
train_data['soynlp_morphs_contents'] = train_data['document'].apply(soynlp_morphs)
 
CPU times: user 23 s, sys: 0 ns, total: 23 s
Wall time: 23 s
In [10]:
X = train_data.soynlp_morphs_contents
y = train_data.label
In [11]:
## X,y
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)  ### vovab을 만들고 각 단어마다 고유번호를 매긴 후 번호를 순서에 따라 매핑하는 방법
In [12]:
word_to_index = tokenizer.word_index
VOCAB_SIZE = len(word_to_index) + 1  ##  padding 할때 필요한 0 index 추가
In [13]:
MAX_LEN = max(len(seq) for seq in sequences)
MAX_LEN
Out[13]:
70
In [14]:
def encode_labels(sources):
    classes = [source for source in sources]
    one_hots = to_categorical(classes)
    return one_hots
In [15]:
def create_sequences(texts, max_len=MAX_LEN):
    sequences = tokenizer.texts_to_sequences(texts)
    padded_sequences = pad_sequences(sequences, max_len, padding='post')
    return padded_sequences
In [16]:
X_train, X_valid, y_train, y_valid = train_test_split(create_sequences(X), encode_labels(y), test_size=0.1, random_state=42)
In [17]:
N_CLASSES =2
In [18]:
y_train.shape
Out[18]:
(135000, 2)
In [19]:
VOCAB_SIZE
Out[19]:
130466
 

2.Train

 

BahdanauAttention

In [20]:
class BahdanauAttention(tf.keras.Model):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = Dense(units)
        self.W2 = Dense(units)
        self.V = Dense(1)

    def call(self, values, query): # 단, key와 value는 같음
        # hidden shape == (batch_size, hidden size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden size)
        # we are doing this to perform addition to calculate the score
        hidden_with_time_axis = tf.expand_dims(query, 1)

        # score shape == (batch_size, max_length, 1)
        # we get 1 at the last axis because we are applying score to self.V
        # the shape of the tensor before applying self.V is (batch_size, max_length, units)
        score = self.V(tf.nn.tanh(
            self.W1(values) + self.W2(hidden_with_time_axis)))

        # attention_weights shape == (batch_size, max_length, 1)
        attention_weights = tf.nn.softmax(score, axis=1)

        # context_vector shape after sum == (batch_size, hidden_size)
        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector, attention_weights
In [21]:
def build_attention_model(vocab_size, max_len, embed_dim):    
    sequence_input = Input(shape=(max_len,), dtype='int32')
    embedded_sequences = Embedding(vocab_size+1, embed_dim, input_length=max_len)(sequence_input)
    lstm, forward_h, forward_c, backward_h, backward_c = Bidirectional(LSTM(128,
                                                                          dropout=0.3,
                                                                          return_sequences=True,
                                                                          return_state=True,
                                                                          recurrent_activation='relu',
                                                                          recurrent_initializer='glorot_uniform'
                                                                        )
                                                                      )(embedded_sequences)

    print(lstm.shape, forward_h.shape, forward_c.shape, backward_h.shape, backward_c.shape)
    
    state_h = Concatenate()([forward_h, backward_h]) # 은닉 상태
    state_c = Concatenate()([forward_c, backward_c]) # 셀 상태

    attention = BahdanauAttention(32) # 가중치 크기 정의
    context_vector, attention_weights = attention(lstm, state_h)

    hidden = BatchNormalization()(context_vector)
    dense1 = Dense(20, activation="relu")(hidden)
    dropout = Dropout(0.05)(dense1)
    output = Dense(N_CLASSES, activation='softmax')(dropout)
    model = Model(inputs=sequence_input, outputs=output)
    Adam = optimizers.Adam(lr=0.001, clipnorm=1.)
    model.compile(optimizer=Adam, loss='categorical_crossentropy', metrics=['accuracy'])
    return model
In [22]:
def build_bisimpernn_model(vocab_size, embed_dim, max_len, units, N_CLASSES):
    model = Sequential([
        Embedding(vocab_size + 1, embed_dim, input_shape=[max_len]), 
        Bidirectional(SimpleRNN(units,
                          dropout=0.3)
                      ),
        Dense(N_CLASSES, activation='softmax')
        ])
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(
            learning_rate=0.0005, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,
            name='Adam'
        ),
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    return model
 

 

bi-simpleRNN

In [23]:
%%time

tf.random.set_seed(33)

MODEL_DIR = os.path.join(LOGDIR, 'bidirecttional_lstm')
shutil.rmtree(MODEL_DIR, ignore_errors=True)

EPOCHS = 15
BATCH_SIZE = 300
EMBED_DIM = 100
UNITS = 16
PATIENCE = 0

# vocab_size, embed_dim, input_length, units, N_CLASSES

bisimplernn_model = build_bisimpernn_model(vocab_size=VOCAB_SIZE,
                                  embed_dim=EMBED_DIM,
                                  max_len=MAX_LEN,
                                  units=UNITS,
                                  N_CLASSES=N_CLASSES)

history = bisimplernn_model.fit(
    X_train, 
    y_train, 
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(X_valid, y_valid),
#     callbacks=[EarlyStopping(patience=PATIENCE), TensorBoard(MODEL_DIR)],
    callbacks=[TensorBoard(MODEL_DIR)],
    use_multiprocessing=True, ## OOM error out of memory error가 발생할때 True로 변경하여 준다 
    verbose=1
)

pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
pd.DataFrame(history.history)[['accuracy', 'val_accuracy']].plot()

bisimplernn_model.summary()
 
Train on 135000 samples, validate on 15000 samples
Epoch 1/15
135000/135000 [==============================] - 184s 1ms/sample - loss: 0.4706 - accuracy: 0.7675 - val_loss: 0.3639 - val_accuracy: 0.8396
Epoch 2/15
135000/135000 [==============================] - 181s 1ms/sample - loss: 0.2857 - accuracy: 0.8861 - val_loss: 0.3691 - val_accuracy: 0.8388
Epoch 3/15
135000/135000 [==============================] - 180s 1ms/sample - loss: 0.1958 - accuracy: 0.9266 - val_loss: 0.3977 - val_accuracy: 0.8367
Epoch 4/15
135000/135000 [==============================] - 157s 1ms/sample - loss: 0.1353 - accuracy: 0.9515 - val_loss: 0.4607 - val_accuracy: 0.8286
Epoch 5/15
135000/135000 [==============================] - 154s 1ms/sample - loss: 0.0969 - accuracy: 0.9665 - val_loss: 0.5014 - val_accuracy: 0.8277
Epoch 6/15
135000/135000 [==============================] - 158s 1ms/sample - loss: 0.0789 - accuracy: 0.9722 - val_loss: 0.5499 - val_accuracy: 0.8255
Epoch 7/15
135000/135000 [==============================] - 157s 1ms/sample - loss: 0.0630 - accuracy: 0.9781 - val_loss: 0.5973 - val_accuracy: 0.8209
Epoch 8/15
135000/135000 [==============================] - 162s 1ms/sample - loss: 0.0539 - accuracy: 0.9814 - val_loss: 0.6408 - val_accuracy: 0.8210
Epoch 9/15
135000/135000 [==============================] - 157s 1ms/sample - loss: 0.0472 - accuracy: 0.9837 - val_loss: 0.6890 - val_accuracy: 0.8165
Epoch 10/15
135000/135000 [==============================] - 159s 1ms/sample - loss: 0.0422 - accuracy: 0.9854 - val_loss: 0.7204 - val_accuracy: 0.8087
Epoch 11/15
135000/135000 [==============================] - 154s 1ms/sample - loss: 0.0389 - accuracy: 0.9865 - val_loss: 0.7431 - val_accuracy: 0.8148
Epoch 12/15
135000/135000 [==============================] - 155s 1ms/sample - loss: 0.0351 - accuracy: 0.9880 - val_loss: 0.7821 - val_accuracy: 0.8120
Epoch 13/15
135000/135000 [==============================] - 151s 1ms/sample - loss: 0.0326 - accuracy: 0.9887 - val_loss: 0.8159 - val_accuracy: 0.8127
Epoch 14/15
135000/135000 [==============================] - 156s 1ms/sample - loss: 0.0316 - accuracy: 0.9888 - val_loss: 0.8249 - val_accuracy: 0.8093
Epoch 15/15
135000/135000 [==============================] - 158s 1ms/sample - loss: 0.0308 - accuracy: 0.9888 - val_loss: 0.8440 - val_accuracy: 0.8084
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 70, 100)           13046700  
_________________________________________________________________
bidirectional (Bidirectional (None, 32)                3744      
_________________________________________________________________
dense (Dense)                (None, 2)                 66        
=================================================================
Total params: 13,050,510
Trainable params: 13,050,510
Non-trainable params: 0
_________________________________________________________________
CPU times: user 4h 15min 38s, sys: 6h 26min 44s, total: 10h 42min 22s
Wall time: 40min 22s
 
 
 
loss값이 epoch 1 이후로 발산하므로, early stopping을 이용해 stop 시켰어야함을 알 수 있음
In [24]:
tf.keras.utils.plot_model(bisimplernn_model, show_shapes=True, dpi=90) 
Out[24]:
 

 

bi-SimpleRNN with Attention

In [25]:
%%time

tf.random.set_seed(33)

MODEL_DIR = os.path.join(LOGDIR, 'bidirecttional_lstm_attention')
shutil.rmtree(MODEL_DIR, ignore_errors=True)

EPOCHS = 15
BATCH_SIZE = 300
EMBED_DIM = 30
UNITS = 16
PATIENCE = 0

attention_model = build_attention_model(vocab_size=VOCAB_SIZE, max_len=MAX_LEN, embed_dim=EMBED_DIM)

history = attention_model.fit(
    X_train, 
    y_train, 
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(X_valid, y_valid),
#     callbacks=[EarlyStopping(patience=PATIENCE), TensorBoard(MODEL_DIR)],
    callbacks=[TensorBoard(MODEL_DIR)],
    use_multiprocessing=True, ## OOM error out of memory error가 발생할때 True로 변경하여 준다 
    verbose=1
)

pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
pd.DataFrame(history.history)[['accuracy', 'val_accuracy']].plot()

attention_model.summary()
 
(None, 70, 256) (None, 128) (None, 128) (None, 128) (None, 128)
Train on 135000 samples, validate on 15000 samples
WARNING:tensorflow:Model failed to serialize as JSON. Ignoring... 
Epoch 1/15
135000/135000 [==============================] - 408s 3ms/sample - loss: 0.4181 - accuracy: 0.7948 - val_loss: 0.4209 - val_accuracy: 0.8551
Epoch 2/15
135000/135000 [==============================] - 392s 3ms/sample - loss: 0.2529 - accuracy: 0.8986 - val_loss: 0.3336 - val_accuracy: 0.8542
Epoch 3/15
135000/135000 [==============================] - 393s 3ms/sample - loss: 0.1752 - accuracy: 0.9343 - val_loss: 0.3890 - val_accuracy: 0.8495
Epoch 4/15
135000/135000 [==============================] - 390s 3ms/sample - loss: 0.1332 - accuracy: 0.9512 - val_loss: 0.4437 - val_accuracy: 0.8449
Epoch 5/15
135000/135000 [==============================] - 386s 3ms/sample - loss: 0.1081 - accuracy: 0.9601 - val_loss: 0.5304 - val_accuracy: 0.8422
Epoch 6/15
135000/135000 [==============================] - 389s 3ms/sample - loss: 0.0932 - accuracy: 0.9652 - val_loss: 0.6455 - val_accuracy: 0.8415
Epoch 7/15
135000/135000 [==============================] - 392s 3ms/sample - loss: 0.0828 - accuracy: 0.9688 - val_loss: 0.5788 - val_accuracy: 0.8421
Epoch 8/15
135000/135000 [==============================] - 387s 3ms/sample - loss: 0.0740 - accuracy: 0.9713 - val_loss: 0.6805 - val_accuracy: 0.8392
Epoch 9/15
135000/135000 [==============================] - 386s 3ms/sample - loss: 0.0680 - accuracy: 0.9738 - val_loss: 0.7251 - val_accuracy: 0.8402
Epoch 10/15
135000/135000 [==============================] - 390s 3ms/sample - loss: 0.0638 - accuracy: 0.9752 - val_loss: 0.7648 - val_accuracy: 0.8394
Epoch 11/15
135000/135000 [==============================] - 389s 3ms/sample - loss: 0.0595 - accuracy: 0.9767 - val_loss: 0.7528 - val_accuracy: 0.8389
Epoch 12/15
135000/135000 [==============================] - 391s 3ms/sample - loss: 0.0557 - accuracy: 0.9780 - val_loss: 0.8318 - val_accuracy: 0.8326
Epoch 13/15
135000/135000 [==============================] - 386s 3ms/sample - loss: 0.0527 - accuracy: 0.9793 - val_loss: 0.8682 - val_accuracy: 0.8379
Epoch 14/15
135000/135000 [==============================] - 390s 3ms/sample - loss: 0.0510 - accuracy: 0.9801 - val_loss: 0.8196 - val_accuracy: 0.8355
Epoch 15/15
135000/135000 [==============================] - 392s 3ms/sample - loss: 0.0480 - accuracy: 0.9811 - val_loss: 0.8171 - val_accuracy: 0.8367
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 70)]         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 70, 30)       3914010     input_1[0][0]                    
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) [(None, 70, 256), (N 162816      embedding_1[0][0]                
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 256)          0           bidirectional_1[0][1]            
                                                                 bidirectional_1[0][3]            
__________________________________________________________________________________________________
bahdanau_attention (BahdanauAtt ((None, 256), (None, 16481       bidirectional_1[0][0]            
__________________________________________________________________________________________________
batch_normalization (BatchNorma (None, 256)          1024        bahdanau_attention[0][0]         
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, 20)           5140        batch_normalization[0][0]        
__________________________________________________________________________________________________
dropout (Dropout)               (None, 20)           0           dense_4[0][0]                    
__________________________________________________________________________________________________
dense_5 (Dense)                 (None, 2)            42          dropout[0][0]                    
==================================================================================================
Total params: 4,099,513
Trainable params: 4,099,001
Non-trainable params: 512
__________________________________________________________________________________________________
CPU times: user 10h 51min 50s, sys: 15h 6min 52s, total: 1d 1h 58min 43s
Wall time: 1h 37min 43s
 
 
In [26]:
tf.keras.utils.plot_model(attention_model, show_shapes=True, dpi=90) 
Out[26]:
 

3.Test

In [27]:
test_data['document'] = test_data['document'].apply(str)
test_data['soynlp_morphs_contents'] = test_data['document'].apply(soynlp_morphs)
X_test=test_data.soynlp_morphs_contents
y_test=test_data.label
In [28]:
X_test, y_test = create_sequences(X_test), encode_labels(y_test)
In [29]:
def convert_argmax(array):
    return np.argmax(array, axis=1)
 

Bi-derectional simpleRNN

In [30]:
bisimplernn_model_pred = bisimplernn_model.predict(X_test)
target_names = ['불건전', '건전']
print(confusion_matrix(convert_argmax(y_test), convert_argmax(bisimplernn_model_pred)))
print(classification_report(convert_argmax(y_test), 
                            convert_argmax(bisimplernn_model_pred), 
                            target_names=target_names))
 
[[19900  4927]
 [ 4669 20504]]
              precision    recall  f1-score   support

         불건전       0.81      0.80      0.81     24827
          건전       0.81      0.81      0.81     25173

    accuracy                           0.81     50000
   macro avg       0.81      0.81      0.81     50000
weighted avg       0.81      0.81      0.81     50000

 

Bi-derectional LSTM with attention

In [31]:
attention_model_pred = attention_model.predict(X_test)
target_names = ['불건전', '건전']
print(confusion_matrix(convert_argmax(y_test), convert_argmax(attention_model_pred)))
print(classification_report(convert_argmax(y_test), 
                            convert_argmax(attention_model_pred), 
                            target_names=target_names))
 
[[20807  4020]
 [ 4059 21114]]
              precision    recall  f1-score   support

         불건전       0.84      0.84      0.84     24827
          건전       0.84      0.84      0.84     25173

    accuracy                           0.84     50000
   macro avg       0.84      0.84      0.84     50000
weighted avg       0.84      0.84      0.84     50000

 

4.결과

 
  • 불건전 컨텐츠를 분류하는 recall 비교에서
    • bi-directional simple RNN
    • bi-directional LSTM with Attention 의 성능이 0.84로 0.81인 bi-directional simple RNN보다 높음
  • 학습시간
    • bi-directional simple RNN : 40분
    • bi-directional LSTM with Attention : 1시간 37분
    • bi-directional LSTM with Attention > bi-directional simple RNN
  • 종합결과
    • 하이퍼파라미터 튜닝이나 양질의 데이터를 이용하여 다시 실험할 경우, 결과가 다를 수 있겠으나,
    • attention 모델의 장점을 확인함

 

728x90
반응형
LIST
Comments