Emma knowed on the door. No answer. She knoced again and waited. There was a large maple tree next to the house. Emma looked up the tree and saw a giant raven perched at the treetop. Under the afternoon sun, the raven gleamed magnificently. Its beak was hard and pointed, its claws sharp and strong. It looked regal and imposing. It reigned the tree it stood on. The raven was looking straight at Emma with its beady black eye, Emma felt slightly intimidated. She took a step back from the door and tentatively said, "Hello?"
'magnificently', 'gleamed', 'intimidated', 'tentatively', 'reigned' 등과 같이 중요한 단어들도 BoW로는 찾아내기 힘듦

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

sentence = 'Emma knowed on the door. No answer. She knoced again and waited. There was a large maple tree next to the house. Emma looked up the tree and saw a giant raven perched at the treetop. Under the afternoon sun, the raven gleamed magnificently. Its beak was hard and pointed, its claws sharp and strong. It looked regal and imposing. It reigned the tree it stood on. The raven was looking straight at Emma with its beady black eye, Emma felt slightly intimidated. She took a step back from the door and tentatively said, "Hello?"'
bow_converter = CountVectorizer()
X = bow_converter.fit_transform([sentence])
features = bow_converter.get_feature_names()
df = pd.DataFrame()
df['features'] = features
df['cnt'] = X.toarray()[0]
df[df['cnt']>2]

tf-idf: BoW 비틀기¶

의미있는 단어를 강조
Term frequency - inverse document frequency
bow(w, d) = 문서 d에서 단어 w가 나타나는 수
tf-idf(w, d) = bow(w, d) X log(N/(단어 w가 나타나는 문서 수))
- 많이 나오되, 다른 문서에서 자주 나오지 않는 단어 (희귀어는 두드러지게, 상용어는 무시)

tf-idf 테스트¶

tf-idf는 한 상수와의 곱셈을 통한 단어 카운트 피처의 변환임 => 피처 스케일링의 한 예
스케일링된 피처와 그렇지 않은 피처 간의 성능 비교 (텍스트 분류)

import json
import pandas as pd

# Yelp 비즈니스 데이터 로드
biz_f = open('yelp_academic_dataset_business.json', encoding='UTF8')
biz_df = pd.DataFrame([json.loads(x) for x in biz_f.readlines()])
biz_f.close()

# Yelp 리뷰 데이터 로드
review_file = open('review.json', encoding='UTF8')
review_df = pd.DataFrame([json.loads(x) for x in review_file.readlines()])
review_file.close()

biz_df.head()

review_df.head()

# Nightlife 또는 Restaurants 비즈니스만 추출
# None data 삭제
# categories == None인 데이터 제거
nan_idx = biz_df.loc[biz_df.categories.isnull()].index
biz_df.drop(biz_df.index[[nan_idx]], inplace=True)

two_biz = biz_df[biz_df.apply(lambda x: 'Nightlife' in x['categories'] or 'Restaurants' in x['categories'], axis=1)]

# 비즈니스 데이터와 리뷰 데이터를 하나로 통합
twobiz_reviews = two_biz.merge(review_df, on='business_id', how='inner')
twobiz_reviews.head()

# 사용하지 않는 피처 제거
twobiz_reviews = twobiz_reviews[['business_id',
                                 'name',
                                 'stars_y',
                                 'text',
                                 'categories']]

# Nightlif 비즈니스면 True, 아니면 False Column 생성
twobiz_reviews['target'] = twobiz_reviews.apply(lambda x: 'Nightlife' in x['categories'], axis=1)

리뷰 Text를 활용하여 레스토랑 혹은 나이트 라이프 범주로 분류¶

불균형 데이터셋 (두 범주의 리뷰 카운트에 큰 차이가 있음) => 모델이 클래스가 큰 데이터에 더 많은 시간을 사용해 피팅
여기서는 더 큰 클래스를 작은 클래스와 동일한 사이즈로 다운 샘플링

print('Number of restaurant: ' + str(len(twobiz_reviews[twobiz_reviews['target']==False])))
print('Number of nightlife: ' + str(len(twobiz_reviews[twobiz_reviews['target']==True])))

Number of restaurant: 3192186
Number of nightlife: 1202166

# 모델에 사용하기 위해 클래스 균형을 맞춘 샘플링 데이터 생성
nightlife = twobiz_reviews[twobiz_reviews.apply(lambda x: 'Nightlife' in x['categories'], axis=1)]
restaurants = twobiz_reviews[twobiz_reviews.apply(lambda x: 'Restaurants' in x['categories'], axis=1)]

nightlife_subset = nightlife.sample(frac=0.1, random_state=123)
restaurant_subset = restaurants.sample(frac=0.03, random_state=123)
combined = pd.concat([nightlife_subset, restaurant_subset])

# 트레이닝셋과 테스트셋으로 분할
from sklearn.model_selection import train_test_split
training_data, test_data = train_test_split(combined, train_size=0.7, test_size=0.3, random_state=123)

print(training_data.shape)
print(test_data.shape)

(172387, 6)
(73881, 6)

선형 분류에 대한 BoW, tf-idf, l² 정규화¶

트레이닝셋에 대한 스케일링 결과가 테스트셋에 깔끔하게 매핑되지 않음
누락되는 데이터가 생김 (테스트셋에만 있는 단어의 경우)
- 보통 새로운 단어는 제외
- 혹은 'Garbage' 단어를 만들어서 트레이닝셋에서 명시적으로 학습하고 테스트셋의 이 단어들을 여기에 매핑

# 리뷰 텍스트를 BoW로 변환
from sklearn.feature_extraction import text

bow_transform = text.CountVectorizer()
X_tr_bow = bow_transform.fit_transform(training_data['text'])
X_te_bow = bow_transform.transform(test_data['text'])
len(bow_transform.vocabulary_)

95654

y_tr = training_data['target']
y_te = test_data['target']

# BoW 행렬을 사용해 tf-idf 표현 생성
tfidf_trfm = text.TfidfTransformer(norm=None)
X_tr_tfidf = tfidf_trfm.fit_transform(X_tr_bow)
X_te_tfidf = tfidf_trfm.transform(X_te_bow)

# 비교를 위해 BoW 표현을 L2 정규화
import sklearn.preprocessing as preproc
X_tr_l2 = preproc.normalize(X_tr_bow, axis=0)
X_te_l2 = preproc.normalize(X_te_bow, axis=0)

로지스틱 회귀¶

시그모이드 함수: 1/(1+e^{-w^Tx-b</sup>) (Range: 0~1)}
<0.5 => False, >0.5 => True

from sklearn.linear_model import LogisticRegression

def simple_logistic_classify(X_tr, y_tr, X_test, y_test, description, _C=1.0):
    m = LogisticRegression(C=_C).fit(X_tr, y_tr)
    s = m.score(X_test, y_test)
    print('Test score with', description, 'features:', s)
    return m

m1 = simple_logistic_classify(X_tr_bow, y_tr, X_te_bow, y_te, 'bow')
m2 = simple_logistic_classify(X_tr_l2, y_tr, X_te_l2, y_te, 'l2-normalized')
m3 = simple_logistic_classify(X_tr_tfidf, y_tr, X_te_tfidf, y_te, 'tf-idf')

Test score with bow features: 0.7674909652007958
Test score with l2-normalized features: 0.7361567926801207
Test score with tf-idf features: 0.7424100919045492

Regularization을 통한 로지스틱 회귀 튜닝¶

Feature 수가 데이터 포인트 수보다 큰 경우 최상의 모델을 찾는 문제가 과소평가되는 경향 (Overfitting?) => Regularization으로 해결
Hyperparameter 튜닝 -> 그리드 검색 -> Deep learning에서는 Random한 방식
K-fold cross validation
GridSearchCV -> K-fold와 그리드 검색 함께 실행

# 그리드 검색을 통한 로지스틱 회귀 하이퍼파라미터 튜닝
import sklearn.model_selection as modsel

# 검색할 그리드를 지정. 각 피처 집합에 댛해 5등분 그리드 검색을 수행
param_grid_ = {'C': [1e-5, 1e-3, 1e-1, 1e0, 1e1, 1e2]}

# BoW 표현에 대한 분류기 튜닝
bow_search = modsel.GridSearchCV(LogisticRegression(), cv=5, 
                               param_grid=param_grid_)
bow_search.fit(X_tr_bow, y_tr)

GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [1e-05, 0.001, 0.1, 1.0, 10.0, 100.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

# L2 정규화된 단어 벡터에 대한 분류기 튜닝
l2_search = modsel.GridSearchCV(LogisticRegression(), cv=5,
                               param_grid=param_grid_)
l2_search.fit(X_tr_l2, y_tr)

GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [1e-05, 0.001, 0.1, 1.0, 10.0, 100.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

# tf-idf에 대한 분류기 튜닝
tfidf_search = modsel.GridSearchCV(LogisticRegression(), cv=5,
                                  param_grid=param_grid_)
tfidf_search.fit(X_tr_tfidf, y_tr)

GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [1e-05, 0.001, 0.1, 1.0, 10.0, 100.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

# BoW 결과 출력
bow_search.cv_results_

D:\Anaconda\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('split0_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
D:\Anaconda\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('split1_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
D:\Anaconda\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('split2_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
D:\Anaconda\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('split3_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
D:\Anaconda\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('split4_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
D:\Anaconda\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('mean_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
D:\Anaconda\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('std_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)

{'mean_fit_time': array([  1.41952262,   4.24020009,  34.77255497, 111.57552967,
        218.28999233, 191.94038186]),
 'std_fit_time': array([1.53911039e-02, 1.55586522e-01, 1.33868671e+00, 9.82034884e+00,
        3.98804937e+01, 1.92471981e+01]),
 'mean_score_time': array([0.01456137, 0.01495895, 0.01506171, 0.01695457, 0.01576467,
        0.01476154]),
 'std_score_time': array([0.00101827, 0.00109471, 0.0011967 , 0.00109267, 0.00115247,
        0.001323  ]),
 'param_C': masked_array(data=[1e-05, 0.001, 0.1, 1.0, 10.0, 100.0],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 1e-05},
  {'C': 0.001},
  {'C': 0.1},
  {'C': 1.0},
  {'C': 10.0},
  {'C': 100.0}],
 'split0_test_score': array([0.61526191, 0.7456349 , 0.76732989, 0.76135507, 0.75076861,
        0.74702709]),
 'split1_test_score': array([0.61555195, 0.7449098 , 0.76947619, 0.76138407, 0.74676605,
        0.74862231]),
 'split2_test_score': array([0.61536677, 0.74440932, 0.76714911, 0.75931781, 0.74472837,
        0.74029063]),
 'split3_test_score': array([0.61501871, 0.74806393, 0.77152885, 0.76604693, 0.75293674,
        0.75157351]),
 'split4_test_score': array([0.61548279, 0.74577254, 0.76816428, 0.76062302, 0.74983322,
        0.7467297 ]),
 'mean_test_score': array([0.61533642, 0.74575809, 0.76872966, 0.76174538, 0.7490066 ,
        0.74684866]),
 'std_test_score': array([0.00018724, 0.00125467, 0.00162295, 0.00227766, 0.00291743,
        0.00370198]),
 'rank_test_score': array([6, 5, 1, 2, 3, 4]),
 'split0_train_score': array([0.61508676, 0.75125626, 0.81600186, 0.86204671, 0.89427086,
        0.89890435]),
 'split1_train_score': array([0.61608742, 0.75149555, 0.8147329 , 0.86211922, 0.89750488,
        0.89242907]),
 'split2_train_score': array([0.61524908, 0.75194692, 0.81548836, 0.86161265, 0.90075412,
        0.90417664]),
 'split3_train_score': array([0.61519107, 0.75136683, 0.81409615, 0.85995214, 0.89242259,
        0.891828  ]),
 'split4_train_score': array([0.61515481, 0.75185266, 0.81578566, 0.86180117, 0.89430788,
        0.89792618]),
 'mean_train_score': array([0.61535383, 0.75158365, 0.81522099, 0.86150638, 0.89585207,
        0.89705285]),
 'std_train_score': array([0.00037055, 0.00027066, 0.00070761, 0.00079767, 0.00294645,
        0.00455231])}

# 교차 검증 결과를 BoxPlot으로 나타내 분류기들의 성능을 시각화 및 비교
search_results = pd.DataFrame.from_dict({
    'bow': bow_search.cv_results_['mean_test_score'],
    'tfidf': tfidf_search.cv_results_['mean_test_score'],
    'l2': l2_search.cv_results_['mean_test_score']
})

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
ax = sns.boxplot(data=search_results, width=0.4)
ax.set_ylabel('Accuracy', size=14)
ax.tick_params(labelsize=14)

# 최선의 하이퍼파라미터를 사용한 모델 학습
m1 = simple_logistic_classify(X_tr_bow, y_tr, X_te_bow, y_te, 'bow',
                             _C=bow_search.best_params_['C'])
m2 = simple_logistic_classify(X_tr_l2, y_tr, X_te_l2, y_te, 'l2',
                             _C=l2_search.best_params_['C'])
m3 = simple_logistic_classify(X_tr_tfidf, y_tr, X_te_tfidf, y_te, 'tf-idf',
                             _C=tfidf_search.best_params_['C'])

Test score with bow features: 0.7713079140780444
Test score with l2 features: 0.7691016634858759
Test score with tf-idf features: 0.7737848702643441

심층 분석¶

BoW 벡터: 문서-단어 행렬 ==> 희소 행렬 (대부분 항목은 0의 값)
피처 스케일링 기법 ==> 데이터 행렬의 열에 대해서 작동
선형 분류기를 학습시키는 것은 데이터 행렬의 열 벡터인 피처에 대해서 최선의 선형 조합을 찾는 것
- Column space: 데이터 행렬의 열 벡터의 선형 조합으로 만들어낼 수 있는 벡터들의 공간 (Ax =b)
- Null space: Ax=0를 만족시키는 벡터들의 공간, 기존 데이터의 선형 조합으로는 형성될 수 없는 '새로운' 데이터 포인트
데이터 행렬의 Null space는 두 가지 이유에서 매우 클 수 있음 ==> 행 공간과 열 공간의 Rank가 작음
- 비슷한 데이터 포인트를 포함할 때 (희소 행렬)
- 데이터 포인트 < 피처 수
피처 스케일링은 데이터 행렬의 랭크 부족 문제를 해결하기 어려움
피처 스케일링을 통해 행렬의 조건 수를 향상시킴으로써 선형 모델을 보다 빠르게 학습시킬 수 있음
피처 엔지니어링의 효과를 분석하는 것이 어려움

	address	attributes	business_id	categories	city	hours	is_open	latitude	longitude	name	postal_code	review_count	stars	state
0	2818 E Camino Acequia Drive	{'GoodForKids': 'False'}	1SWheh84yJXfytovILXOAQ	Golf, Active Life	Phoenix	None	0	33.522143	-112.018481	Arizona Biltmore Golf Club	85016	5	3.0	AZ
1	30 Eglinton Avenue W	{'RestaurantsReservations': 'True', 'GoodForMe...	QXAEGFB4oINsVuTFxEYKFQ	Specialty Food, Restaurants, Dim Sum, Imported...	Mississauga	{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...	1	43.605499	-79.652289	Emerald Chinese Restaurant	L5R 3E7	128	2.5	ON
2	10110 Johnston Rd, Ste 15	{'GoodForKids': 'True', 'NoiseLevel': 'u'avera...	gnKjwL_1w79qoiV3IC_xQQ	Sushi Bars, Restaurants, Japanese	Charlotte	{'Monday': '17:30-21:30', 'Wednesday': '17:30-...	1	35.092564	-80.859132	Musashi Japanese Restaurant	28210	170	4.0	NC
3	15655 W Roosevelt St, Ste 237	None	xvX2CttrVhyG2z1dFg_0xw	Insurance, Financial Services	Goodyear	{'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ...	1	33.455613	-112.395596	Farmers Insurance - Paul Lorenz	85338	3	5.0	AZ
4	4209 Stuart Andrew Blvd, Ste F	{'BusinessAcceptsBitcoin': 'False', 'ByAppoint...	HhyxOkGAM07SRYtlQ4wMFQ	Plumbing, Shopping, Local Services, Home Servi...	Charlotte	{'Monday': '7:0-23:0', 'Tuesday': '7:0-23:0', ...	1	35.190012	-80.887223	Queen City Plumbing	28217	4	4.0	NC

	business_id	date	funny	review_id	stars	text	useful	user_id
0	ujmEBvifdJM6h6RLv4wQIg	2013-05-07 04:34:36	1	Q1sbwvVQXV2734tPgoKj4Q	1.0	Total bill for this horrible service? Over $8G...	6	hG7b0MtEbXx5QzbzE6C_VA
1	NZnhc2sEQy3RmzKTZnqtwQ	2017-01-14 21:30:33	0	GJXCdrto3ASJOqKeVWPi6Q	5.0	I adore Travis at the Hard Rock's new Kelly ...	0	yXQM5uF2jS6es16SJzNHfg
2	WTqjgwHlXbSFevF32_DJVw	2016-11-09 20:09:03	0	2TzJjDVDEuAW6MR5Vuc1ug	5.0	I have to say that this office really has it t...	3	n6-Gk65cPZL6Uz8qRm3NYw
3	ikCg8xy5JIg_NGPx-MSIDA	2018-01-09 20:56:38	0	yi0R0Ugj_xUx_Nek0-_Qig	5.0	Went in for a lunch. Steak sandwich was delici...	0	dacAIZ6fTM6mqwW5uxkskg
4	b1b1eb3uo-w561D0ZfCEiQ	2018-01-30 23:07:38	0	11a8sVPMUFtaC7_ABRkmtw	1.0	Today was my second out of three sessions I ha...	7	ssoyf2_x0EQMed6fgHeMyQ

	address	attributes	business_id	categories	city	hours	is_open	latitude	longitude	name	...	stars_x	state	cool	date	funny	review_id	stars_y	text	useful	user_id
0	30 Eglinton Avenue W	{'RestaurantsReservations': 'True', 'GoodForMe...	QXAEGFB4oINsVuTFxEYKFQ	Specialty Food, Restaurants, Dim Sum, Imported...	Mississauga	{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...	1	43.605499	-79.652289	Emerald Chinese Restaurant	...	2.5	ON	0	2017-01-27 21:54:30	2	6W0MQHmasK0IsaoDo4bmkw	3.0	My girlfriend and I went for dinner at Emerald...	3	2K62MJ4CJ19L8Tp5pRfjfQ
1	30 Eglinton Avenue W	{'RestaurantsReservations': 'True', 'GoodForMe...	QXAEGFB4oINsVuTFxEYKFQ	Specialty Food, Restaurants, Dim Sum, Imported...	Mississauga	{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...	1	43.605499	-79.652289	Emerald Chinese Restaurant	...	2.5	ON	0	2013-06-24 23:11:30	0	BeeBfUxvzD4qNX4HxrgA5g	3.0	We've always been there on a Sunday so we were...	0	A0kENtCCoVT3m7T35zb2Vg
2	30 Eglinton Avenue W	{'RestaurantsReservations': 'True', 'GoodForMe...	QXAEGFB4oINsVuTFxEYKFQ	Specialty Food, Restaurants, Dim Sum, Imported...	Mississauga	{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...	1	43.605499	-79.652289	Emerald Chinese Restaurant	...	2.5	ON	0	2016-01-04 12:59:22	0	A1D2kUnZ0HTroFreAheNSg	3.0	*No automatic doors, not baby friendly!* I...	0	SuOLY03LW5ZcnynKhbTydA
3	30 Eglinton Avenue W	{'RestaurantsReservations': 'True', 'GoodForMe...	QXAEGFB4oINsVuTFxEYKFQ	Specialty Food, Restaurants, Dim Sum, Imported...	Mississauga	{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...	1	43.605499	-79.652289	Emerald Chinese Restaurant	...	2.5	ON	0	2014-05-09 02:38:43	0	2pf45Stf-pNew-xgTababQ	1.0	Horrible service,\nI went there tonight with m...	1	lymyUak6KNcNKoDbK87MiQ
4	30 Eglinton Avenue W	{'RestaurantsReservations': 'True', 'GoodForMe...	QXAEGFB4oINsVuTFxEYKFQ	Specialty Food, Restaurants, Dim Sum, Imported...	Mississauga	{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...	1	43.605499	-79.652289	Emerald Chinese Restaurant	...	2.5	ON	2	2011-03-21 14:39:55	1	RHhlmL07evgAdPaXQV8Omg	4.0	One of the gauges of a good Chinese restaurant...	2	6vU0I5XgCv9OQHZ76rV6qw

Ch.6 Dimensionality Reduction: Squashing the Data Pancake with PCA (0)	2020.05.06
Ch.5 Categorical Variables: Counting Eggs in theAge of Robotic Chickens (0)	2020.04.14
Ch3. Text Data: Flattening, Filtering,and Chunking (0)	2020.03.27
Ch2. Fancy Tricks with Simple Numbers (0)	2020.03.27
Ch1. The Machine Learning Pipeline (0)	2020.03.27

데이터과학 삼학년

데이터과학 삼학년

Ch4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf 본문

Ch4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf

tf-idf: BoW 비틀기¶

tf-idf 테스트¶

리뷰 Text를 활용하여 레스토랑 혹은 나이트 라이프 범주로 분류¶

선형 분류에 대한 BoW, tf-idf, l² 정규화¶

로지스틱 회귀¶

Regularization을 통한 로지스틱 회귀 튜닝¶

심층 분석¶

'Feature Engineering' 카테고리의 다른 글

티스토리툴바

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

데이터과학 삼학년

Ch4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf 본문

Ch4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf

tf-idf: BoW 비틀기¶

tf-idf 테스트¶

리뷰 Text를 활용하여 레스토랑 혹은 나이트 라이프 범주로 분류¶

선형 분류에 대한 BoW, tf-idf, l2 정규화¶

로지스틱 회귀¶

Regularization을 통한 로지스틱 회귀 튜닝¶

심층 분석¶

'Feature Engineering' 카테고리의 다른 글

티스토리툴바

선형 분류에 대한 BoW, tf-idf, l² 정규화¶