데이터과학 삼학년

tf.keras (2.0) & soynlp를 이용한 텍스트 분류 (DNN, RNN, CNN) 본문

Natural Language Processing

tf.keras (2.0) & soynlp를 이용한 텍스트 분류 (DNN, RNN, CNN)

Dan-k 2020. 6. 12. 11:42
반응형

형태소 분해를 위해 soynlp를 이용하고,

분류문제를 풀기 위해 tf.keras를 이용하여 이진분류를 한다.

모델은 DNN, RNN, CNN을 간단하게 적용한다.

 

import numpy as np
import pandas as pd

from soynlp.tokenizer import MaxScoreTokenizer
from soynlp.word import WordExtractor
from soynlp.tokenizer import LTokenizer

import os
import shutil

import tensorflow as tf
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping
from tensorflow.keras.layers import (
    Embedding,
    Flatten,
    GRU,
    Conv1D,
    Lambda,
    Dense,
)
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical


print(tf.__version__)

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
2.0.0
 
LOGDIR = "./text_models"
 

 

1.Preprocess

data = pd.read_csv('label_data.csv',encoding='utf-8',index_col=0)
 
data.head()
 
  contents label
0 hello,hi,we find to people here,hello,my dog j... 1
1 asdasdad 1
2 dasfsdfasdfsdfas 1
3 ㄹㄹㄹㄹㄹㄹㄹ 1
4 dddd 1
data['contents'] = data['contents'].apply(str)
 
corpus = data.contents.apply(str)
word_extractor = WordExtractor()
word_extractor.train(corpus)
word_score = word_extractor.extract()
scores = {word:score.cohesion_forward for word, score in word_score.items()}
maxscore_tokenizer = MaxScoreTokenizer(scores=scores)

def soynlp_morphs(contents):
    return ' '.join(maxscore_tokenizer.tokenize(contents))
training was done. used memory 0.510 Gbry 0.355 Gb
all cohesion probabilities was computed. # words = 34332
all branching entropies was computed # words = 58734
all accessor variety was computed # words = 58734

 

 
%%time
data['soynlp_morphs_contents'] = data['contents'].apply(soynlp_morphs)
CPU times: user 7.91 s, sys: 4 ms, total: 7.92 s
Wall time: 7.91 s
 
X = data.soynlp_morphs_contents
y = data.label
 
## X,y
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)  ### vovab을 만들고 각 단어마다 고유번호를 매긴 후 번호를 순서에 따라 매핑하는 방법
 
word_to_index = tokenizer.word_index
VOCAB_SIZE = len(word_to_index) + 1  ##  padding 할때 필요한 0 index 추가
 
MAX_LEN = max(len(seq) for seq in sequences)
MAX_LEN
 
12308
 
def encode_labels(sources):
    classes = [source for source in sources]
    one_hots = to_categorical(classes)
    return one_hots
 
def create_sequences(texts, max_len=MAX_LEN):
    sequences = tokenizer.texts_to_sequences(texts)
    padded_sequences = pad_sequences(sequences, max_len, padding='post')
    return padded_sequences
 
X_train, X_test, y_train, y_test = train_test_split(create_sequences(X), encode_labels(y), test_size=0.3, random_state=42)
 
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.1, random_state=42)
N_CLASSES =2
 

2.Train

DNN

def build_dnn_model(embed_dim):

    model = Sequential([
        Embedding(VOCAB_SIZE + 1, embed_dim, input_shape=[MAX_LEN]), 
        Lambda(lambda x: tf.reduce_mean(x, axis=1)), 
        Dense(100, activation='relu'),
        Dense(100, activation='relu'),
        Dense(N_CLASSES, activation='softmax')  ## activation=tf.nn.softmax
    ])

    model.compile(
        optimizer=tf.keras.optimizers.Adam(
            learning_rate=0.0005, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,
            name='Adam'
        ),
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    return model
%%time

tf.random.set_seed(33) # tf 2.0에서 적용

MODEL_DIR = os.path.join(LOGDIR, 'dnn')
shutil.rmtree(MODEL_DIR, ignore_errors=True)

BATCH_SIZE = 300
EPOCHS = 50
EMBED_DIM = 100
PATIENCE = 0

dnn_model = build_dnn_model(embed_dim=EMBED_DIM)

dnn_history = dnn_model.fit(
    X_train, y_train,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(X_valid, y_valid),
    callbacks=[EarlyStopping(patience=PATIENCE), TensorBoard(MODEL_DIR)],
)

## tf 2.0 : 'accuracy', 'val_accuracy','loss', 'val_loss'
pd.DataFrame(dnn_history.history)[['loss', 'val_loss']].plot()
pd.DataFrame(dnn_history.history)[['accuracy', 'val_accuracy']].plot()

dnn_model.summary()
Train on 694 samples, validate on 78 samples
Epoch 1/50
694/694 [==============================] - 3s 4ms/sample - loss: 0.6887 - accuracy: 0.6556 - val_loss: 0.6799 - val_accuracy: 0.6923
Epoch 2/50
694/694 [==============================] - 2s 2ms/sample - loss: 0.6809 - accuracy: 0.6556 - val_loss: 0.6705 - val_accuracy: 0.6923
Epoch 3/50
694/694 [==============================] - 2s 3ms/sample - loss: 0.6736 - accuracy: 0.6556 - val_loss: 0.6617 - val_accuracy: 0.6923
Epoch 4/50
694/694 [==============================] - 2s 3ms/sample - loss: 0.6669 - accuracy: 0.6556 - val_loss: 0.6529 - val_accuracy: 0.6923
Epoch 5/50
694/694 [==============================] - 2s 3ms/sample - loss: 0.6604 - accuracy: 0.6556 - val_loss: 0.6448 - val_accuracy: 0.6923
Epoch 6/50
694/694 [==============================] - 2s 2ms/sample - loss: 0.6548 - accuracy: 0.6556 - val_loss: 0.6370 - val_accuracy: 0.6923
Epoch 7/50
694/694 [==============================] - 2s 3ms/sample - loss: 0.6496 - accuracy: 0.6556 - val_loss: 0.6298 - val_accuracy: 0.6923
Epoch 8/50
694/694 [==============================] - 2s 3ms/sample - loss: 0.6449 - accuracy: 0.6556 - val_loss: 0.6238 - val_accuracy: 0.6923
Epoch 9/50
694/694 [==============================] - 2s 3ms/sample - loss: 0.6427 - accuracy: 0.6556 - val_loss: 0.6190 - val_accuracy: 0.6923
Epoch 10/50
694/694 [==============================] - 2s 2ms/sample - loss: 0.6409 - accuracy: 0.6556 - val_loss: 0.6159 - val_accuracy: 0.6923
Epoch 11/50
694/694 [==============================] - 2s 2ms/sample - loss: 0.6399 - accuracy: 0.6556 - val_loss: 0.6140 - val_accuracy: 0.6923
Epoch 12/50
694/694 [==============================] - 2s 3ms/sample - loss: 0.6401 - accuracy: 0.6556 - val_loss: 0.6126 - val_accuracy: 0.6923
Epoch 13/50
694/694 [==============================] - 2s 3ms/sample - loss: 0.6402 - accuracy: 0.6556 - val_loss: 0.6120 - val_accuracy: 0.6923
Epoch 14/50
694/694 [==============================] - 2s 3ms/sample - loss: 0.6401 - accuracy: 0.6556 - val_loss: 0.6119 - val_accuracy: 0.6923
Epoch 15/50
694/694 [==============================] - 2s 3ms/sample - loss: 0.6396 - accuracy: 0.6556 - val_loss: 0.6119 - val_accuracy: 0.6923
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 12308, 100)        4110900   
_________________________________________________________________
lambda_1 (Lambda)            (None, 100)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_4 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_5 (Dense)              (None, 2)                 202       
=================================================================
Total params: 4,131,302
Trainable params: 4,131,302
Non-trainable params: 0
_________________________________________________________________
CPU times: user 3min 24s, sys: 1min 8s, total: 4min 33s
Wall time: 28.2 s
 
 
dnn_history.history.keys()
 

 

dict_keys(['loss', 'val_loss', 'accuracy', 'val_accuracy'])
 

RNN

def build_rnn_model(embed_dim, units):

    model = Sequential([
        Embedding(VOCAB_SIZE + 1, embed_dim, input_shape=[MAX_LEN], mask_zero=True), 
        GRU(units), 
        Dense(N_CLASSES, activation='softmax')
    ])

    model.compile(
        optimizer='adam',
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    return model
%%time

tf.random.set_seed(33)

MODEL_DIR = os.path.join(LOGDIR, 'rnn')
shutil.rmtree(MODEL_DIR, ignore_errors=True)

EPOCHS = 15
BATCH_SIZE = 300
EMBED_DIM = 100
UNITS = 16
PATIENCE = 0

rnn_model = build_rnn_model(embed_dim=EMBED_DIM, units=UNITS)

history = rnn_model.fit(
    X_train, y_train,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(X_valid, y_valid),
    callbacks=[EarlyStopping(patience=PATIENCE), TensorBoard(MODEL_DIR)],
    use_multiprocessing=True ## OOM error out of memory error가 발생할때 True로 변경하여 준다 
)

pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
pd.DataFrame(history.history)[['accuracy', 'val_accuracy']].plot()

rnn_model.summary()
 
Train on 694 samples, validate on 78 samples
Epoch 1/15
694/694 [==============================] - 415s 598ms/sample - loss: 0.6899 - accuracy: 0.5735 - val_loss: 0.6808 - val_accuracy: 0.6667
Epoch 2/15
694/694 [==============================] - 263s 378ms/sample - loss: 0.6704 - accuracy: 0.7738 - val_loss: 0.6702 - val_accuracy: 0.7308
Epoch 3/15
694/694 [==============================] - 296s 426ms/sample - loss: 0.6512 - accuracy: 0.8329 - val_loss: 0.6591 - val_accuracy: 0.7436
Epoch 4/15
694/694 [==============================] - 313s 451ms/sample - loss: 0.6299 - accuracy: 0.8660 - val_loss: 0.6471 - val_accuracy: 0.7308
Epoch 5/15
694/694 [==============================] - 292s 421ms/sample - loss: 0.6059 - accuracy: 0.8890 - val_loss: 0.6341 - val_accuracy: 0.7308
Epoch 6/15
694/694 [==============================] - 264s 380ms/sample - loss: 0.5790 - accuracy: 0.9308 - val_loss: 0.6202 - val_accuracy: 0.7692
Epoch 7/15
694/694 [==============================] - 314s 452ms/sample - loss: 0.5490 - accuracy: 0.9496 - val_loss: 0.6049 - val_accuracy: 0.7821
Epoch 8/15
694/694 [==============================] - 311s 449ms/sample - loss: 0.5154 - accuracy: 0.9582 - val_loss: 0.5888 - val_accuracy: 0.7949
Epoch 9/15
694/694 [==============================] - 343s 495ms/sample - loss: 0.4789 - accuracy: 0.9683 - val_loss: 0.5717 - val_accuracy: 0.8205
Epoch 10/15
694/694 [==============================] - 320s 461ms/sample - loss: 0.4397 - accuracy: 0.9769 - val_loss: 0.5540 - val_accuracy: 0.8205
Epoch 11/15
694/694 [==============================] - 290s 418ms/sample - loss: 0.3978 - accuracy: 0.9798 - val_loss: 0.5355 - val_accuracy: 0.8205
Epoch 12/15
694/694 [==============================] - 335s 483ms/sample - loss: 0.3550 - accuracy: 0.9856 - val_loss: 0.5170 - val_accuracy: 0.8205
Epoch 13/15
694/694 [==============================] - 324s 466ms/sample - loss: 0.3118 - accuracy: 0.9870 - val_loss: 0.4987 - val_accuracy: 0.8333
Epoch 14/15
694/694 [==============================] - 322s 464ms/sample - loss: 0.2699 - accuracy: 0.9870 - val_loss: 0.4814 - val_accuracy: 0.8333
Epoch 15/15
694/694 [==============================] - 328s 473ms/sample - loss: 0.2304 - accuracy: 0.9885 - val_loss: 0.4655 - val_accuracy: 0.8333
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, 12308, 100)        4110900   
_________________________________________________________________
gru (GRU)                    (None, 16)                5664      
_________________________________________________________________
dense_6 (Dense)              (None, 2)                 34        
=================================================================
Total params: 4,116,598
Trainable params: 4,116,598
Non-trainable params: 0
_________________________________________________________________
CPU times: user 10h 1min 4s, sys: 9h 48min 35s, total: 19h 49min 40s
Wall time: 1h 18min 51s
 
 
 

CNN

def build_cnn_model(embed_dim, filters, ksize, strides):

    model = Sequential([
        Embedding(
            VOCAB_SIZE + 1,
            embed_dim,
            input_shape=[MAX_LEN],
            mask_zero=True),
        Conv1D( 
            filters=filters,
            kernel_size=ksize,
            strides=strides,
            activation='relu',
        ),
        Flatten(), 
        Dense(N_CLASSES, activation='softmax')
    ])

    model.compile(
        optimizer='adam',
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    return model
%%time

tf.random.set_seed(33) 

MODEL_DIR = os.path.join(LOGDIR, 'cnn')
shutil.rmtree(MODEL_DIR, ignore_errors=True)

EPOCHS = 50
BATCH_SIZE = 300
EMBED_DIM = 100
FILTERS = 200
STRIDES = 2
KSIZE = 3
PATIENCE = 0


cnn_model = build_cnn_model(
    embed_dim=EMBED_DIM,
    filters=FILTERS,
    strides=STRIDES,
    ksize=KSIZE,
)

cnn_history = cnn_model.fit(
    X_train, y_train,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(X_valid, y_valid),
    callbacks=[EarlyStopping(patience=PATIENCE), TensorBoard(MODEL_DIR)],
)

pd.DataFrame(cnn_history.history)[['loss', 'val_loss']].plot()
pd.DataFrame(cnn_history.history)[['accuracy', 'val_accuracy']].plot()

cnn_model.summary()
 
Train on 694 samples, validate on 78 samples
Epoch 1/50
694/694 [==============================] - 19s 28ms/sample - loss: 4.0082 - accuracy: 0.6571 - val_loss: 1.5037 - val_accuracy: 0.3077
Epoch 2/50
694/694 [==============================] - 17s 24ms/sample - loss: 1.3142 - accuracy: 0.3573 - val_loss: 0.5711 - val_accuracy: 0.7179
Epoch 3/50
694/694 [==============================] - 16s 24ms/sample - loss: 0.5577 - accuracy: 0.6902 - val_loss: 0.5591 - val_accuracy: 0.7179
Epoch 4/50
694/694 [==============================] - 17s 24ms/sample - loss: 0.5430 - accuracy: 0.6960 - val_loss: 0.5476 - val_accuracy: 0.8205
Epoch 5/50
694/694 [==============================] - 16s 23ms/sample - loss: 0.4928 - accuracy: 0.8516 - val_loss: 0.5210 - val_accuracy: 0.8590
Epoch 6/50
694/694 [==============================] - 14s 20ms/sample - loss: 0.4265 - accuracy: 0.8905 - val_loss: 0.4245 - val_accuracy: 0.8718
Epoch 7/50
694/694 [==============================] - 13s 19ms/sample - loss: 0.3282 - accuracy: 0.8804 - val_loss: 0.3535 - val_accuracy: 0.8718
Epoch 8/50
694/694 [==============================] - 16s 23ms/sample - loss: 0.2459 - accuracy: 0.9251 - val_loss: 0.3064 - val_accuracy: 0.8590
Epoch 9/50
694/694 [==============================] - 17s 24ms/sample - loss: 0.1719 - accuracy: 0.9424 - val_loss: 0.2806 - val_accuracy: 0.9103
Epoch 10/50
694/694 [==============================] - 16s 24ms/sample - loss: 0.1191 - accuracy: 0.9625 - val_loss: 0.2891 - val_accuracy: 0.8974
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 12308, 100)        4110900   
_________________________________________________________________
conv1d (Conv1D)              (None, 6153, 200)         60200     
_________________________________________________________________
flatten (Flatten)            (None, 1230600)           0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 2461202   
=================================================================
Total params: 6,632,302
Trainable params: 6,632,302
Non-trainable params: 0
_________________________________________________________________
CPU times: user 29min 52s, sys: 12min 9s, total: 42min 1s
Wall time: 2min 41s
 
 
 

3.Test

def convert_argmax(array):
    return np.argmax(array, axis=1)
 

DNN

dnn_model_pred = dnn_model.predict(X_test)
target_names = ['불건전', '건전']
print(confusion_matrix(convert_argmax(y_test), convert_argmax(dnn_model_pred)))
print(classification_report(convert_argmax(y_test), convert_argmax(dnn_model_pred), target_names=target_names))
 
[[  0  99]
 [  0 233]]
              precision    recall  f1-score   support

         불건전       0.00      0.00      0.00        99
          건전       0.70      1.00      0.82       233

    accuracy                           0.70       332
   macro avg       0.35      0.50      0.41       332
weighted avg       0.49      0.70      0.58       332

 
/usr/local/lib/python3.5/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
 

RNN

rnn_model_pred = rnn_model.predict(X_test)
target_names = ['불건전', '건전']
print(confusion_matrix(convert_argmax(y_test), convert_argmax(rnn_model_pred)))
print(classification_report(convert_argmax(y_test), convert_argmax(rnn_model_pred), target_names=target_names))
 
[[ 79  20]
 [ 16 217]]
              precision    recall  f1-score   support

         불건전       0.83      0.80      0.81        99
          건전       0.92      0.93      0.92       233

    accuracy                           0.89       332
   macro avg       0.87      0.86      0.87       332
weighted avg       0.89      0.89      0.89       332

 

CNN

cnn_model_pred = cnn_model.predict(X_test)
target_names = ['불건전', '건전']
print(confusion_matrix(convert_argmax(y_test), convert_argmax(cnn_model_pred)))
print(classification_report(convert_argmax(y_test), convert_argmax(cnn_model_pred), target_names=target_names))
 
[[ 87  12]
 [ 25 208]]
              precision    recall  f1-score   support

         불건전       0.78      0.88      0.82        99
          건전       0.95      0.89      0.92       233

    accuracy                           0.89       332
   macro avg       0.86      0.89      0.87       332
weighted avg       0.90      0.89      0.89       332

 

 

4.결과

  • 불건전 컨텐츠를 분류하는 recall 비교에서 CNN > RNN >DNN 순으로 성능이 좋게 나왔으며, DNN의 경우, 거의 분류를 못하는 것으로 보여짐

  • 학습시간 DNN 28초, RNN 1시간 18초, CNN 2분 41초 로 DNN < CNN < RNN 순으로 학습시간이 걸림

  • 종합결과 하이퍼파라미터 튜닝이나 양질의 데이터를 이용하여 다시 실험할 경우, 결과가 다를 수 있겠으나, CNN이 가장 적합한 것으로 판단된다.

 

728x90
반응형
LIST
Comments