일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
- login crawling
- spark udf
- integrated gradient
- tensorflow text
- airflow subdag
- session 유지
- API Gateway
- GenericGBQException
- Retry
- Airflow
- XAI
- TensorFlow
- BigQuery
- top_k
- 공분산
- UDF
- correlation
- chatGPT
- youtube data
- gather_nd
- 유튜브 API
- grad-cam
- subdag
- 상관관계
- requests
- hadoop
- Counterfactual Explanations
- flask
- API
- GCP
- Today
- Total
데이터과학 삼학년
Text classification using GCP ai-platform 본문
GCP ai-platform 을 이용하여 serverless 머신러닝을 적용한다.
텍스트 분류 문제를 풀기 위한 구성 방법이다.
Custom config
- learning rate 설정 가능 --> learning rate scheduler 적용
- optimizer 설정 : ['Adadelta', 'Adagrad', 'Adamax', 'Nadam', 'RMSprop', 'SGD', 'Adam']
- model 설정 : ['DNN', 'CNN', 'RNN', 'BiRNN', 'BiLSTM', 'BiLSTM_ATTENTION']
- 그외 embedded dim, cell units, batch size, patient(early stopping), epoch 등 설정 가능
- 파라미터 넣을때, epoch, units, batch size 등 무조건 int로 받아야함 (float이면 안됨)
모델 학습
!gcloud ai-platform jobs submit training text_practice_model_20200626 \
--job-dir gs://text/model/text_practice_model \
--module-name trainer.task \
--package-path ./trainer \
--region us-central1 \
--python-version 3.7 \
--runtime-version 2.1 \
--stream-logs \
-- \
--model-name='RNN' \
--optimizer='Adam' \
--learning-rate=0.001 \
--embed-dim=32 \
--n-classes=2 \
--train-files=gs://text/movie_train.csv \
--pred-files=gs://text/movie_predict.csv \
--pred-sequence-files=gs://text/preprocess/movie_predict_sequence.json \
--num-epoch=5 \
--batch-size=128
모델 버전 등록
# Create model
!gcloud ai-platform models create text_practice_model \
--regions us-central1
# Create model version based on that SavedModel directory
!gcloud beta ai-platform versions create v1 \
--model text_practice_model \
--runtime-version 2.1 \
--python-version 3.7 \
--framework tensorflow \
--origin gs://text/model/text_practice_model/keras_export
온라인 예측 (Online prediction)
## Online predict (local)
!gcloud ai-platform predict \
--model text_practice_model \
--version v1 \
--json-instances prediction_input.json
배치 예측 (Batch prediction)
!gcloud ai-platform jobs submit prediction predict_text_model_pactice \
--model-dir 'gs://text/model/text_practice_model/keras_export' \
--runtime-version 2.1 \
--data-format text \
--region us-central1 \
--input-paths 'gs://text/preprocess/movie_predict_sequence.json' \
--output-path 'gs://text/predict/practice_output'
### Wait predict job done
!gcloud ai-platform jobs stream-logs predict_text_model_pactice
AI-platform 예측 시 주의점
1. online prediction에서는 차원이 맞지 않아도 예측 결과 나옴
2. batch predictin에서 차원이 정확히 일치해야함
3. ai-platform 예측
- 예측 파일 형태 : json, tf-record, tf-record gzip
- 파일을 json 파일 변환 후 예측 진행
- csv 파일을 예측에 넣기 위해서는 tf-record 형식으로 변환해야함
> dataset = tf.data.TextLineDataset(file_paths)
파일구성
- trainer
- __init__.py
- task.py
- model.py
- util.py
- setup.py
setup.py
from setuptools import find_packages
from setuptools import setup
REQUIRED_PACKAGES = [
'tensorflow==2.1.0',
'tensorflow-model-analysis==0.15.0',
'scipy>=1.1.0',
'tensorboard==2.1.0',
'soynlp>=0.0.493',
'tensorflow-estimator==2.1.0',
'gcsfs'
]
setup(
name='trainer',
version='0.1',
install_requires=REQUIRED_PACKAGES,
packages=find_packages(),
include_package_data=True,
description='census trainer package.'
)
task.py
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import os
import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping
from trainer import model
from trainer import util
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument(
'--job-dir',
type=str,
required=True,
help='local or GCS location for writing checkpoints and exporting '
'models')
parser.add_argument(
'--train-files',
type=str,
required=True)
parser.add_argument(
'--pred-files',
type=str,
required=True)
parser.add_argument(
'--pred-sequence-files',
type=str,
required=True)
parser.add_argument(
'--num-epochs',
type=int,
default=10,
help='number of times to go through the data, default=20')
parser.add_argument(
'--batch-size',
default=128,
type=int,
help='number of records to read during each training step, default=128')
parser.add_argument(
'--learning-rate',
default=0.001,
type=float,
help='learning rate for gradient descent, default=.01')
parser.add_argument(
'--units',
default=16,
type=int,
help='the number of cells of model')
parser.add_argument(
'--embed-dim',
default=32,
type=int,
help='embedded dimensional which you want to set')
parser.add_argument(
'--n-classes',
default=2,
type=int,
help='the number of class to classify')
parser.add_argument(
'--model-name',
choices=['DNN', 'CNN', 'RNN','BiRNN','BiLSTM','BiLSTM_ATTENTION'],
type=str,
required=True)
parser.add_argument(
'--optimizer',
choices=['Adadelta', 'Adagrad', 'Adamax', 'Nadam', 'RMSprop', 'SGD', 'Adam'],
default='Adam')
parser.add_argument(
'--patience',
type=int,
default=1)
parser.add_argument(
'--verbosity',
choices=['DEBUG', 'ERROR', 'FATAL', 'INFO', 'WARN'],
default='INFO')
args, _ = parser.parse_known_args()
return args
def train_and_evaluate(args):
train_x, train_y, eval_x, eval_y, VOCAB_SIZE, MAX_LEN = util.load_data(args.train_files, args.pred_files,
args.pred_sequence_files)
# dimensions
num_train_examples, input_dim = train_x.shape
num_eval_examples = eval_x.shape[0]
# Create the Keras Model
optimizer = util.set_optimizer(name=args.optimizer, learning_rate=args.learning_rate)
keras_model = model.create_keras_model(model_name=args.model_name,
embed_dim=args.embed_dim, vocab_size=VOCAB_SIZE,
max_len=MAX_LEN, units=args.units, n_classes=args.n_classes,
optimizer=optimizer)
# Pass a numpy array by passing DataFrame.values
training_dataset = model.input_fn(
features=train_x,
labels=train_y,
shuffle=True,
num_epochs=args.num_epochs,
batch_size=args.batch_size)
# Pass a numpy array by passing DataFrame.values
validation_dataset = model.input_fn(
features=eval_x,
labels=eval_y,
shuffle=False,
num_epochs=args.num_epochs,
batch_size=num_eval_examples)
# Setup Learning Rate decay.
lr_decay_cb = tf.keras.callbacks.LearningRateScheduler(
lambda epoch: args.learning_rate + 0.02 * (0.5 ** (1 + epoch)),
verbose=True)
# Setup TensorBoard callback.
tensorboard_cb = tf.keras.callbacks.TensorBoard(
os.path.join(args.job_dir, 'keras_tensorboard'),
histogram_freq=1)
# Train model
keras_model.fit(
training_dataset,
steps_per_epoch=int(num_train_examples / args.batch_size),
epochs=args.num_epochs,
validation_data=validation_dataset,
validation_steps=1,
verbose=1,
callbacks=[EarlyStopping(patience=args.patience), lr_decay_cb, tensorboard_cb])
export_path = os.path.join(args.job_dir, 'keras_export')
tf.compat.v1.keras.experimental.export_saved_model(keras_model, export_path)
print('Model exported to: {}'.format(export_path))
if __name__ == '__main__':
args = get_args()
tf.compat.v1.logging.set_verbosity(args.verbosity)
train_and_evaluate(args)
model.py
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from tensorflow.keras.layers import (
Embedding,
Flatten,
Dense,
SimpleRNN,
GRU,
LSTM,
Conv1D,
Lambda,
Bidirectional,
Concatenate,
Layer,
Dropout,
BatchNormalization
)
from tensorflow.keras.models import Sequential
print(tf.__version__)
def input_fn(features, labels, shuffle, num_epochs, batch_size):
if labels is None:
inputs = features
else:
inputs = (features, labels)
dataset = tf.data.Dataset.from_tensor_slices(inputs)
if shuffle:
dataset = dataset.shuffle(buffer_size=len(features))
# We call repeat after shuffling, rather than before, to prevent separate
# epochs from blending together.
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)
return dataset
def build_dnn_model(vocab_size, embed_dim, max_len, n_classes, optimizer):
model = Sequential([
Embedding(vocab_size + 1, embed_dim, input_shape=[max_len]),
Lambda(lambda x: tf.reduce_mean(x, axis=1)),
Dense(100, activation='relu'),
Dense(n_classes, activation='softmax')
])
model.compile(
optimizer=optimizer,
loss='categorical_crossentropy',
metrics=['accuracy']
)
return model
def build_rnn_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer):
model = Sequential([
Embedding(vocab_size + 1, embed_dim, input_shape=[max_len], mask_zero=True),
GRU(units),
Dense(n_classes, activation='softmax')
])
model.compile(
optimizer=optimizer,
loss='categorical_crossentropy',
metrics=['accuracy']
)
return model
def build_cnn_model(vocab_size, embed_dim, max_len, filters, ksize, strides, n_classes, optimizer):
model = Sequential([
Embedding(
vocab_size + 1,
embed_dim,
input_shape=[max_len],
mask_zero=True),
Conv1D(
filters=filters,
kernel_size=ksize,
strides=strides,
activation='relu',
),
Flatten(),
Dense(n_classes, activation='softmax')
])
model.compile(
optimizer=optimizer,
loss='categorical_crossentropy',
metrics=['accuracy']
)
return model
def build_bisimpernn_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer):
model = Sequential([
Embedding(vocab_size + 1, embed_dim, input_shape=[max_len]),
Bidirectional(SimpleRNN(units,
dropout=0.3)
),
Dense(n_classes, activation='softmax')
])
model.compile(
optimizer=optimizer,
loss='categorical_crossentropy',
metrics=['accuracy']
)
return model
def build_bilstm_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer):
model = Sequential([
Embedding(vocab_size + 1, embed_dim, input_shape=[max_len]),
Bidirectional(LSTM(units,
dropout=0.3)
),
Dense(n_classes, activation='softmax')
])
model.compile(
optimizer=optimizer,
loss='categorical_crossentropy',
metrics=['accuracy']
)
return model
class BahdanauAttention(tf.keras.Model):
def __init__(self, units):
super(BahdanauAttention, self).__init__()
self.W1 = Dense(units)
self.W2 = Dense(units)
self.V = Dense(1)
def call(self, values, query): # 단, key와 value는 같음
# hidden shape == (batch_size, hidden size)
# hidden_with_time_axis shape == (batch_size, 1, hidden size)
# we are doing this to perform addition to calculate the score
hidden_with_time_axis = tf.expand_dims(query, 1)
# score shape == (batch_size, max_length, 1)
# we get 1 at the last axis because we are applying score to self.V
# the shape of the tensor before applying self.V is (batch_size, max_length, units)
score = self.V(tf.nn.tanh(
self.W1(values) + self.W2(hidden_with_time_axis)))
# attention_weights shape == (batch_size, max_length, 1)
attention_weights = tf.nn.softmax(score, axis=1)
# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights * values
context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights
def build_bilstm_attention_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer):
sequence_input = Input(shape=(max_len,), dtype='int32')
embedded_sequences = Embedding(vocab_size + 1, embed_dim, input_length=max_len)(sequence_input)
lstm, forward_h, forward_c, backward_h, backward_c = Bidirectional(LSTM(units,
dropout=0.3,
return_sequences=True,
return_state=True,
recurrent_activation='relu',
recurrent_initializer='glorot_uniform'
)
)(embedded_sequences)
print(lstm.shape, forward_h.shape, forward_c.shape, backward_h.shape, backward_c.shape)
state_h = Concatenate()([forward_h, backward_h]) # 은닉 상태
state_c = Concatenate()([forward_c, backward_c]) # 셀 상태
attention = BahdanauAttention(32) # 가중치 크기 정의
context_vector, attention_weights = attention(lstm, state_h) ##히든 스테이트에 가중치 줄지, 셀 스테이트에 줄지 결정
hidden = BatchNormalization()(context_vector)
dense1 = Dense(20, activation="relu")(hidden)
dropout = Dropout(0.05)(dense1)
output = Dense(n_classes, activation='softmax')(dropout)
model = Model(inputs=sequence_input, outputs=output)
model.compile(optimizer=optimizer,
loss='categorical_crossentropy',
metrics=['accuracy'])
return model
def create_keras_model(model_name, embed_dim, vocab_size, max_len, units, n_classes, optimizer,
ksize=None, strides=None, filters=None):
if ksize == None:
ksize, strides, filters = 3, 2, 200
if model_name == 'DNN':
return build_dnn_model(vocab_size, embed_dim, max_len, n_classes, optimizer)
elif model_name == 'CNN':
return build_cnn_model(vocab_size, embed_dim, max_len, filters, ksize, strides, n_classes, optimizer)
elif model_name == 'RNN':
return build_rnn_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer)
elif model_name == 'BiRNN':
return build_bisimpernn_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer)
elif model_name == 'BiLSTM':
return build_bilstm_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer)
elif model_name == 'BiLSTM_ATTENTION':
return build_bilstm_attention_model(vocab_size, embed_dim, max_len, units, n_classes, optimizer)
else:
print('Define models')
util.py
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import pandas as pd
from sklearn.model_selection import train_test_split
from soynlp.tokenizer import MaxScoreTokenizer
from soynlp.word import WordExtractor
from tensorflow.keras import optimizers
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
import json
from google.cloud import storage
_CSV_COLUMNS = [
'contents', 'label'
]
_LABEL_COLUMN = 'label'
UNUSED_COLUMNS = ['pid', 'regdatetime']
def set_optimizer(name, learning_rate):
if name == 'Adadelta':
opt = optimizers.Adadelta(learning_rate=learning_rate)
elif name == 'Adagrad':
opt = optimizers.Adagrad(learning_rate=learning_rate)
elif name == 'Adamax':
opt = optimizers.Adamax(learning_rate=learning_rate)
elif name == 'Nadam':
opt = optimizers.Nadam(learning_rate=learning_rate)
elif name == 'RMSprop':
opt = optimizers.RMSprop(learning_rate=learning_rate)
elif name == 'SGD':
opt = optimizers.SGD(learning_rate=learning_rate)
else:
opt = optimizers.Adam(learning_rate=learning_rate)
return opt
def get_tokenizer(data):
corpus = data.contents.apply(str)
word_extractor = WordExtractor()
word_extractor.train(corpus)
word_score = word_extractor.extract()
scores = {word: score.cohesion_forward for word, score in word_score.items()}
maxscore_tokenizer = MaxScoreTokenizer(scores=scores)
return maxscore_tokenizer
def encode_labels(sources):
classes = [source for source in sources]
one_hots = to_categorical(classes)
return one_hots
def to_gcs(file, gcs_path):
pd.DataFrame(file).to_csv(gcs_path, header=False, index=False)
print('completed saving files in gcs')
def get_df(file_path):
df = pd.read_csv(file_path,
names=_CSV_COLUMNS,
encoding='utf-8')
df = df.loc[:, 'contents':]
df['contents'] = df['contents'].apply(str)
return df
def get_gcs_args(pred_sequence_files):
path_sep = pred_sequence_files.split('/')
bucket_name = path_sep[2]
destination_path = '/'.join(path_sep[3:])
return bucket_name, destination_path
def upload_blob(bucket_name, source_file_name, destination_blob_name):
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
print(
"File {} uploaded to {}".format(
source_file_name, destination_blob_name
)
)
def upload_pred_json_file_to_gcs(pred_df, bucket_name, destination_path, samples=None):
# pred_df = pd.read_csv(seq_files)
source_file_name = 'temp_prediction_input.json'
if samples is not None:
prediction_input = pd.DataFrame(pred_df).sample(samples)
else:
prediction_input = pd.DataFrame(pred_df)
with open(source_file_name, 'w') as json_file:
for row in prediction_input.values.tolist():
json.dump(row, json_file)
json_file.write('\n')
upload_blob(bucket_name, source_file_name=source_file_name, destination_blob_name=destination_path)
def load_data(training_file_path, pred_file_path, pred_sequence_files):
train_df = get_df(training_file_path)
test_df = get_df(pred_file_path)
### 전처리 통합되면 필요없음, x,y 만 다시 설정 ###
_tokenizer = get_tokenizer(train_df)
def soynlp_morphs(contents):
return ' '.join(_tokenizer.tokenize(contents))
train_df['soynlp_morphs_contents'] = train_df['contents'].apply(soynlp_morphs)
test_df['soynlp_morphs_contents'] = test_df['contents'].apply(soynlp_morphs)
#####
X = train_df.soynlp_morphs_contents
y = train_df.label
X_pred = test_df.soynlp_morphs_contents
# X = train_df.contents
# y = train_df.label
# X_pred = test_df.contents
#################################################
tokenizer = Tokenizer() ## default split=' '
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)
word_to_index = tokenizer.word_index
VOCAB_SIZE = len(word_to_index) + 1
MAX_LEN = max(len(seq) for seq in sequences)
def create_sequences(texts, max_len):
sequences = tokenizer.texts_to_sequences(texts)
padded_sequences = pad_sequences(sequences, max_len, padding='post')
return padded_sequences
X_train, X_eval, y_train, y_eval = train_test_split(create_sequences(X, max_len=MAX_LEN), encode_labels(y),
test_size=0.3, random_state=42)
#############################################################
### pred_file --> json파일로 만들어 넘길 필요가 있음
X_pred = create_sequences(X_pred, max_len=MAX_LEN)
# to_gcs(X_pred, pred_sequence_files)
### csv -> to json -> upload files to GCS
bucket_name, destination_path = get_gcs_args(pred_sequence_files)
upload_pred_json_file_to_gcs(X_pred, bucket_name, destination_path)
print('upload predict json files')
#############################################################
### if exists y_pred label...
try:
y_pred = test_df.label
y_pred = encode_labels(y_pred)
to_gcs(y_pred, pred_sequence_files.replace('.csv', '_label.csv'))
except:
print('y_pred is not exist')
return X_train, y_train, X_eval, y_eval, VOCAB_SIZE, MAX_LEN
'Machine Learning' 카테고리의 다른 글
[TF.2.x] Keras 모델의 predict output을 사용자가 커스텀하는 방법 (0) | 2020.07.09 |
---|---|
[TF 2.x] tf.keras prediction 결과 custom하기(feat. GCP ai-platform) (0) | 2020.07.08 |
tf.keras.callbacks.LearningRateScheduler (1) | 2020.06.24 |
Speech to text (Speech Recognition API and PyAudio library) (0) | 2020.06.18 |
Sequence Model (RNN, LSTM) (0) | 2020.06.02 |