데이터과학 삼학년

Text classification using GCP ai-platform 본문

Machine Learning

Text classification using GCP ai-platform

Dan-k 2020. 6. 26. 20:07
반응형

GCP ai-platform 을 이용하여 serverless 머신러닝을 적용한다.

텍스트 분류 문제를 풀기 위한 구성 방법이다.

 

  

Custom config

- learning rate 설정 가능 --> learning rate scheduler 적용

- optimizer 설정 : ['Adadelta', 'Adagrad', 'Adamax', 'Nadam', 'RMSprop', 'SGD', 'Adam']

- model 설정 : ['DNN', 'CNN', 'RNN', 'BiRNN', 'BiLSTM', 'BiLSTM_ATTENTION']

- 그외 embedded dim, cell units, batch size, patient(early stopping), epoch 등 설정 가능

- 파라미터 넣을때, epoch, units, batch size 등 무조건 int로 받아야함 (float이면 안됨)

 

 

모델 학습 

!gcloud ai-platform jobs submit training text_practice_model_20200626 \
        --job-dir gs://text/model/text_practice_model \
        --module-name trainer.task \
        --package-path ./trainer \
        --region us-central1 \
        --python-version 3.7 \
        --runtime-version 2.1 \
        --stream-logs \
        -- \
        --model-name='RNN' \
        --optimizer='Adam' \
        --learning-rate=0.001 \
        --embed-dim=32 \
        --n-classes=2 \
        --train-files=gs://text/movie_train.csv \
        --pred-files=gs://text/movie_predict.csv \
        --pred-sequence-files=gs://text/preprocess/movie_predict_sequence.json \
        --num-epoch=5 \
        --batch-size=128 

 

모델 버전 등록

# Create model
!gcloud ai-platform models create text_practice_model \
  --regions us-central1
  
# Create model version based on that SavedModel directory
!gcloud beta ai-platform versions create v1 \
    --model text_practice_model \
    --runtime-version 2.1 \
    --python-version 3.7 \
    --framework tensorflow \
    --origin gs://text/model/text_practice_model/keras_export

온라인 예측 (Online prediction)

## Online predict (local)
!gcloud ai-platform predict \
  --model text_practice_model \
  --version v1 \
  --json-instances prediction_input.json

배치 예측 (Batch prediction)

!gcloud ai-platform jobs submit prediction predict_text_model_pactice \
                 --model-dir 'gs://text/model/text_practice_model/keras_export' \
                 --runtime-version 2.1 \
                 --data-format text \
                 --region us-central1 \
                 --input-paths 'gs://text/preprocess/movie_predict_sequence.json' \
                 --output-path 'gs://text/predict/practice_output'


### Wait predict job done
!gcloud ai-platform jobs stream-logs predict_text_model_pactice

AI-platform 예측 시 주의점

1. online prediction에서는 차원이 맞지 않아도 예측 결과 나옴

2. batch predictin에서 차원이 정확히 일치해야함

3. ai-platform 예측

- 예측 파일 형태 : json, tf-record, tf-record gzip

- 파일을 json 파일 변환 후 예측 진행

- csv 파일을 예측에 넣기 위해서는 tf-record 형식으로 변환해야함 

  > dataset = tf.data.TextLineDataset(file_paths)

 

파일구성

- trainer

    - __init__.py

    - task.py

    - model.py

    - util.py 

- setup.py

 


setup.py

from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = [
    'tensorflow==2.1.0',
    'tensorflow-model-analysis==0.15.0',
    'scipy>=1.1.0',
    'tensorboard==2.1.0',
    'soynlp>=0.0.493',
    'tensorflow-estimator==2.1.0',
    'gcsfs'
]

setup(
    name='trainer',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(),
    include_package_data=True,
    description='census trainer package.'
)

 

task.py

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import os

import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping
from trainer import model
from trainer import util


def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--job-dir',
        type=str,
        required=True,
        help='local or GCS location for writing checkpoints and exporting '
             'models')
    parser.add_argument(
        '--train-files',
        type=str,
        required=True)
    parser.add_argument(
        '--pred-files',
        type=str,
        required=True)
    parser.add_argument(
        '--pred-sequence-files',
        type=str,
        required=True)
    parser.add_argument(
        '--num-epochs',
        type=int,
        default=10,
        help='number of times to go through the data, default=20')
    parser.add_argument(
        '--batch-size',
        default=128,
        type=int,
        help='number of records to read during each training step, default=128')
    parser.add_argument(
        '--learning-rate',
        default=0.001,
        type=float,
        help='learning rate for gradient descent, default=.01')
    parser.add_argument(
        '--units',
        default=16,
        type=int,
        help='the number of cells of model')
    parser.add_argument(
        '--embed-dim',
        default=32,
        type=int,
        help='embedded dimensional which you want to set')
    parser.add_argument(
        '--n-classes',
        default=2,
        type=int,
        help='the number of class to classify')
    parser.add_argument(
        '--model-name',
        choices=['DNN', 'CNN', 'RNN','BiRNN','BiLSTM','BiLSTM_ATTENTION'],
        type=str,
        required=True)
    parser.add_argument(
        '--optimizer',
        choices=['Adadelta', 'Adagrad', 'Adamax', 'Nadam', 'RMSprop', 'SGD', 'Adam'],
        default='Adam')
    parser.add_argument(
        '--patience',
        type=int,
        default=1)
    parser.add_argument(
        '--verbosity',
        choices=['DEBUG', 'ERROR', 'FATAL', 'INFO', 'WARN'],
        default='INFO')
    args, _ = parser.parse_known_args()
    return args


def train_and_evaluate(args):
    train_x, train_y, eval_x, eval_y, VOCAB_SIZE, MAX_LEN = util.load_data(args.train_files, args.pred_files,
                                                                           args.pred_sequence_files)

    # dimensions
    num_train_examples, input_dim = train_x.shape
    num_eval_examples = eval_x.shape[0]

    # Create the Keras Model
    optimizer = util.set_optimizer(name=args.optimizer, learning_rate=args.learning_rate)

    keras_model = model.create_keras_model(model_name=args.model_name,
                                           embed_dim=args.embed_dim, vocab_size=VOCAB_SIZE,
                                           max_len=MAX_LEN, units=args.units, n_classes=args.n_classes,
                                           optimizer=optimizer)

    # Pass a numpy array by passing DataFrame.values
    training_dataset = model.input_fn(
        features=train_x,
        labels=train_y,
        shuffle=True,
        num_epochs=args.num_epochs,
        batch_size=args.batch_size)

    # Pass a numpy array by passing DataFrame.values
    validation_dataset = model.input_fn(
        features=eval_x,
        labels=eval_y,
        shuffle=False,
        num_epochs=args.num_epochs,
        batch_size=num_eval_examples)

    # Setup Learning Rate decay.
    lr_decay_cb = tf.keras.callbacks.LearningRateScheduler(
        lambda epoch: args.learning_rate + 0.02 * (0.5 ** (1 + epoch)),
        verbose=True)

    # Setup TensorBoard callback.
    tensorboard_cb = tf.keras.callbacks.TensorBoard(
        os.path.join(args.job_dir, 'keras_tensorboard'),
        histogram_freq=1)

    # Train model
    keras_model.fit(
        training_dataset,
        steps_per_epoch=int(num_train_examples / args.batch_size),
        epochs=args.num_epochs,
        validation_data=validation_dataset,
        validation_steps=1,
        verbose=1,
        callbacks=[EarlyStopping(patience=args.patience), lr_decay_cb, tensorboard_cb])

    export_path = os.path.join(args.job_dir, 'keras_export')

    tf.compat.v1.keras.experimental.export_saved_model(keras_model, export_path)

    print('Model exported to: {}'.format(export_path))


if __name__ == '__main__':
    args = get_args()
    tf.compat.v1.logging.set_verbosity(args.verbosity)
    train_and_evaluate(args)

 

model.py

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
from tensorflow.keras.layers import (
    Embedding,
    Flatten,
    Dense,
    SimpleRNN,
    GRU,
    LSTM,
    Conv1D,
    Lambda,
    Bidirectional,
    Concatenate,
    Layer,
    Dropout,
    BatchNormalization
)
from tensorflow.keras.models import Sequential

print(tf.__version__)


def input_fn(features, labels, shuffle, num_epochs, batch_size):
    if labels is None:
        inputs = features
    else:
        inputs = (features, labels)
    dataset = tf.data.Dataset.from_tensor_slices(inputs)

    if shuffle:
        dataset = dataset.shuffle(buffer_size=len(features))

    # We call repeat after shuffling, rather than before, to prevent separate
    # epochs from blending together.
    dataset = dataset.repeat(num_epochs)
    dataset = dataset.batch(batch_size)

    return dataset


def build_dnn_model(vocab_size, embed_dim, max_len, n_classes, optimizer):
    model = Sequential([
        Embedding(vocab_size + 1, embed_dim, input_shape=[max_len]),
        Lambda(lambda x: tf.reduce_mean(x, axis=1)),
        Dense(100, activation='relu'),
        Dense(n_classes, activation='softmax')
    ])

    model.compile(
        optimizer=optimizer,
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )

    return model


def build_rnn_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer):
    model = Sequential([
        Embedding(vocab_size + 1, embed_dim, input_shape=[max_len], mask_zero=True),
        GRU(units),
        Dense(n_classes, activation='softmax')
    ])

    model.compile(
        optimizer=optimizer,
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )

    return model


def build_cnn_model(vocab_size, embed_dim, max_len, filters, ksize, strides, n_classes, optimizer):
    model = Sequential([
        Embedding(
            vocab_size + 1,
            embed_dim,
            input_shape=[max_len],
            mask_zero=True),
        Conv1D(
            filters=filters,
            kernel_size=ksize,
            strides=strides,
            activation='relu',
        ),
        Flatten(),
        Dense(n_classes, activation='softmax')
    ])

    model.compile(
        optimizer=optimizer,
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )

    return model


def build_bisimpernn_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer):
    model = Sequential([
        Embedding(vocab_size + 1, embed_dim, input_shape=[max_len]),
        Bidirectional(SimpleRNN(units,
                                dropout=0.3)
                      ),
        Dense(n_classes, activation='softmax')
    ])

    model.compile(
        optimizer=optimizer,
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )

    return model


def build_bilstm_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer):
    model = Sequential([
        Embedding(vocab_size + 1, embed_dim, input_shape=[max_len]),
        Bidirectional(LSTM(units,
                           dropout=0.3)
                      ),
        Dense(n_classes, activation='softmax')
    ])

    model.compile(
        optimizer=optimizer,
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )

    return model


class BahdanauAttention(tf.keras.Model):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = Dense(units)
        self.W2 = Dense(units)
        self.V = Dense(1)

    def call(self, values, query):  # 단, key와 value는 같음
        # hidden shape == (batch_size, hidden size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden size)
        # we are doing this to perform addition to calculate the score
        hidden_with_time_axis = tf.expand_dims(query, 1)

        # score shape == (batch_size, max_length, 1)
        # we get 1 at the last axis because we are applying score to self.V
        # the shape of the tensor before applying self.V is (batch_size, max_length, units)
        score = self.V(tf.nn.tanh(
            self.W1(values) + self.W2(hidden_with_time_axis)))

        # attention_weights shape == (batch_size, max_length, 1)
        attention_weights = tf.nn.softmax(score, axis=1)

        # context_vector shape after sum == (batch_size, hidden_size)
        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector, attention_weights


def build_bilstm_attention_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer):
    sequence_input = Input(shape=(max_len,), dtype='int32')
    embedded_sequences = Embedding(vocab_size + 1, embed_dim, input_length=max_len)(sequence_input)
    lstm, forward_h, forward_c, backward_h, backward_c = Bidirectional(LSTM(units,
                                                                            dropout=0.3,
                                                                            return_sequences=True,
                                                                            return_state=True,
                                                                            recurrent_activation='relu',
                                                                            recurrent_initializer='glorot_uniform'
                                                                            )
                                                                       )(embedded_sequences)

    print(lstm.shape, forward_h.shape, forward_c.shape, backward_h.shape, backward_c.shape)

    state_h = Concatenate()([forward_h, backward_h])  # 은닉 상태
    state_c = Concatenate()([forward_c, backward_c])  # 셀 상태

    attention = BahdanauAttention(32)  # 가중치 크기 정의
    context_vector, attention_weights = attention(lstm, state_h)  ##히든 스테이트에 가중치 줄지, 셀 스테이트에 줄지 결정

    hidden = BatchNormalization()(context_vector)
    dense1 = Dense(20, activation="relu")(hidden)
    dropout = Dropout(0.05)(dense1)
    output = Dense(n_classes, activation='softmax')(dropout)
    model = Model(inputs=sequence_input, outputs=output)

    model.compile(optimizer=optimizer,
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    return model


def create_keras_model(model_name, embed_dim, vocab_size, max_len, units, n_classes, optimizer,
                       ksize=None, strides=None, filters=None):
    if ksize == None:
        ksize, strides, filters = 3, 2, 200

    if model_name == 'DNN':
        return build_dnn_model(vocab_size, embed_dim, max_len, n_classes, optimizer)

    elif model_name == 'CNN':
        return build_cnn_model(vocab_size, embed_dim, max_len, filters, ksize, strides, n_classes, optimizer)

    elif model_name == 'RNN':
        return build_rnn_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer)

    elif model_name == 'BiRNN':
        return build_bisimpernn_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer)

    elif model_name == 'BiLSTM':
        return build_bilstm_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer)

    elif model_name == 'BiLSTM_ATTENTION':
        return build_bilstm_attention_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer)
    else:
        print('Define models')

util.py

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import pandas as pd

from sklearn.model_selection import train_test_split
from soynlp.tokenizer import MaxScoreTokenizer
from soynlp.word import WordExtractor
from tensorflow.keras import optimizers
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical

import json
from google.cloud import storage

_CSV_COLUMNS = [
    'contents', 'label'
]

_LABEL_COLUMN = 'label'

UNUSED_COLUMNS = ['pid', 'regdatetime']


def set_optimizer(name, learning_rate):
    if name == 'Adadelta':
        opt = optimizers.Adadelta(learning_rate=learning_rate)
    elif name == 'Adagrad':
        opt = optimizers.Adagrad(learning_rate=learning_rate)
    elif name == 'Adamax':
        opt = optimizers.Adamax(learning_rate=learning_rate)
    elif name == 'Nadam':
        opt = optimizers.Nadam(learning_rate=learning_rate)
    elif name == 'RMSprop':
        opt = optimizers.RMSprop(learning_rate=learning_rate)
    elif name == 'SGD':
        opt = optimizers.SGD(learning_rate=learning_rate)
    else:
        opt = optimizers.Adam(learning_rate=learning_rate)

    return opt


def get_tokenizer(data):
    corpus = data.contents.apply(str)
    word_extractor = WordExtractor()
    word_extractor.train(corpus)
    word_score = word_extractor.extract()
    scores = {word: score.cohesion_forward for word, score in word_score.items()}
    maxscore_tokenizer = MaxScoreTokenizer(scores=scores)

    return maxscore_tokenizer


def encode_labels(sources):
    classes = [source for source in sources]
    one_hots = to_categorical(classes)

    return one_hots


def to_gcs(file, gcs_path):
    pd.DataFrame(file).to_csv(gcs_path, header=False, index=False)
    print('completed saving files in gcs')


def get_df(file_path):
    df = pd.read_csv(file_path,
                     names=_CSV_COLUMNS,
                     encoding='utf-8')
    df = df.loc[:, 'contents':]
    df['contents'] = df['contents'].apply(str)

    return df


def get_gcs_args(pred_sequence_files):
    path_sep = pred_sequence_files.split('/')
    bucket_name = path_sep[2]
    destination_path = '/'.join(path_sep[3:])

    return bucket_name, destination_path


def upload_blob(bucket_name, source_file_name, destination_blob_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)

    blob.upload_from_filename(source_file_name)

    print(
        "File {} uploaded to {}".format(
            source_file_name, destination_blob_name
        )
    )


def upload_pred_json_file_to_gcs(pred_df, bucket_name, destination_path, samples=None):
    # pred_df = pd.read_csv(seq_files)
    source_file_name = 'temp_prediction_input.json'
    if samples is not None:
        prediction_input = pd.DataFrame(pred_df).sample(samples)
    else:
        prediction_input = pd.DataFrame(pred_df)

    with open(source_file_name, 'w') as json_file:
        for row in prediction_input.values.tolist():
            json.dump(row, json_file)
            json_file.write('\n')

    upload_blob(bucket_name, source_file_name=source_file_name, destination_blob_name=destination_path)


def load_data(training_file_path, pred_file_path, pred_sequence_files):
    train_df = get_df(training_file_path)
    test_df = get_df(pred_file_path)

    ### 전처리 통합되면 필요없음, x,y 만 다시 설정 ###
    _tokenizer = get_tokenizer(train_df)

    def soynlp_morphs(contents):
        return ' '.join(_tokenizer.tokenize(contents))

    train_df['soynlp_morphs_contents'] = train_df['contents'].apply(soynlp_morphs)
    test_df['soynlp_morphs_contents'] = test_df['contents'].apply(soynlp_morphs)

    #####

    X = train_df.soynlp_morphs_contents
    y = train_df.label
    X_pred = test_df.soynlp_morphs_contents

    #     X = train_df.contents
    #     y = train_df.label
    #     X_pred = test_df.contents
    #################################################

    tokenizer = Tokenizer()  ## default split=' '
    tokenizer.fit_on_texts(X)
    sequences = tokenizer.texts_to_sequences(X)
    word_to_index = tokenizer.word_index
    VOCAB_SIZE = len(word_to_index) + 1
    MAX_LEN = max(len(seq) for seq in sequences)

    def create_sequences(texts, max_len):
        sequences = tokenizer.texts_to_sequences(texts)
        padded_sequences = pad_sequences(sequences, max_len, padding='post')

        return padded_sequences

    X_train, X_eval, y_train, y_eval = train_test_split(create_sequences(X, max_len=MAX_LEN), encode_labels(y),
                                                        test_size=0.3, random_state=42)

    #############################################################
    ### pred_file --> json파일로 만들어 넘길 필요가 있음
    X_pred = create_sequences(X_pred, max_len=MAX_LEN)
    #     to_gcs(X_pred, pred_sequence_files)
    ### csv -> to json -> upload files to GCS
    bucket_name, destination_path = get_gcs_args(pred_sequence_files)
    upload_pred_json_file_to_gcs(X_pred, bucket_name, destination_path)
    print('upload predict json files')
    #############################################################

    ### if exists y_pred label...
    try:
        y_pred = test_df.label
        y_pred = encode_labels(y_pred)
        to_gcs(y_pred, pred_sequence_files.replace('.csv', '_label.csv'))
    except:
        print('y_pred is not exist')

    return X_train, y_train, X_eval, y_eval, VOCAB_SIZE, MAX_LEN

 

728x90
반응형
LIST
Comments