데이터과학 삼학년

Ch4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf 본문

Feature Engineering

Ch4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf

Dan-k 2020. 4. 2. 16:11
반응형

BoW (Bag-of-Words)의 한계점

  • Emma knowed on the door. No answer. She knoced again and waited. There was a large maple tree next to the house. Emma looked up the tree and saw a giant raven perched at the treetop. Under the afternoon sun, the raven gleamed magnificently. Its beak was hard and pointed, its claws sharp and strong. It looked regal and imposing. It reigned the tree it stood on. The raven was looking straight at Emma with its beady black eye, Emma felt slightly intimidated. She took a step back from the door and tentatively said, "Hello?"
  • 'magnificently', 'gleamed', 'intimidated', 'tentatively', 'reigned' 등과 같이 중요한 단어들도 BoW로는 찾아내기 힘듦
In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
In [41]:
sentence = 'Emma knowed on the door. No answer. She knoced again and waited. There was a large maple tree next to the house. Emma looked up the tree and saw a giant raven perched at the treetop. Under the afternoon sun, the raven gleamed magnificently. Its beak was hard and pointed, its claws sharp and strong. It looked regal and imposing. It reigned the tree it stood on. The raven was looking straight at Emma with its beady black eye, Emma felt slightly intimidated. She took a step back from the door and tentatively said, "Hello?"'
bow_converter = CountVectorizer()
X = bow_converter.fit_transform([sentence])
features = bow_converter.get_feature_names()
df = pd.DataFrame()
df['features'] = features
df['cnt'] = X.toarray()[0]
df[df['cnt']>2]
Out[41]:
  features cnt
2 and 6
11 emma 4
22 it 3
23 its 3
36 raven 3
50 the 9
54 tree 3
59 was 3
 

tf-idf: BoW 비틀기

  • 의미있는 단어를 강조
  • Term frequency - inverse document frequency
  • bow(w, d) = 문서 d에서 단어 w가 나타나는 수
  • tf-idf(w, d) = bow(w, d) X log(N/(단어 w가 나타나는 문서 수))
    • 많이 나오되, 다른 문서에서 자주 나오지 않는 단어 (희귀어는 두드러지게, 상용어는 무시)
 

tf-idf 테스트

  • tf-idf는 한 상수와의 곱셈을 통한 단어 카운트 피처의 변환임 => 피처 스케일링의 한 예
  • 스케일링된 피처와 그렇지 않은 피처 간의 성능 비교 (텍스트 분류)
In [42]:
import json
import pandas as pd

# Yelp 비즈니스 데이터 로드
biz_f = open('yelp_academic_dataset_business.json', encoding='UTF8')
biz_df = pd.DataFrame([json.loads(x) for x in biz_f.readlines()])
biz_f.close()
In [44]:
# Yelp 리뷰 데이터 로드
review_file = open('review.json', encoding='UTF8')
review_df = pd.DataFrame([json.loads(x) for x in review_file.readlines()])
review_file.close()
In [46]:
biz_df.head()
Out[46]:
  address attributes business_id categories city hours is_open latitude longitude name postal_code review_count stars state
0 2818 E Camino Acequia Drive {'GoodForKids': 'False'} 1SWheh84yJXfytovILXOAQ Golf, Active Life Phoenix None 0 33.522143 -112.018481 Arizona Biltmore Golf Club 85016 5 3.0 AZ
1 30 Eglinton Avenue W {'RestaurantsReservations': 'True', 'GoodForMe... QXAEGFB4oINsVuTFxEYKFQ Specialty Food, Restaurants, Dim Sum, Imported... Mississauga {'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W... 1 43.605499 -79.652289 Emerald Chinese Restaurant L5R 3E7 128 2.5 ON
2 10110 Johnston Rd, Ste 15 {'GoodForKids': 'True', 'NoiseLevel': 'u'avera... gnKjwL_1w79qoiV3IC_xQQ Sushi Bars, Restaurants, Japanese Charlotte {'Monday': '17:30-21:30', 'Wednesday': '17:30-... 1 35.092564 -80.859132 Musashi Japanese Restaurant 28210 170 4.0 NC
3 15655 W Roosevelt St, Ste 237 None xvX2CttrVhyG2z1dFg_0xw Insurance, Financial Services Goodyear {'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ... 1 33.455613 -112.395596 Farmers Insurance - Paul Lorenz 85338 3 5.0 AZ
4 4209 Stuart Andrew Blvd, Ste F {'BusinessAcceptsBitcoin': 'False', 'ByAppoint... HhyxOkGAM07SRYtlQ4wMFQ Plumbing, Shopping, Local Services, Home Servi... Charlotte {'Monday': '7:0-23:0', 'Tuesday': '7:0-23:0', ... 1 35.190012 -80.887223 Queen City Plumbing 28217 4 4.0 NC
In [47]:
review_df.head()
Out[47]:
  business_id cool date funny review_id stars text useful user_id
0 ujmEBvifdJM6h6RLv4wQIg 0 2013-05-07 04:34:36 1 Q1sbwvVQXV2734tPgoKj4Q 1.0 Total bill for this horrible service? Over $8G... 6 hG7b0MtEbXx5QzbzE6C_VA
1 NZnhc2sEQy3RmzKTZnqtwQ 0 2017-01-14 21:30:33 0 GJXCdrto3ASJOqKeVWPi6Q 5.0 I *adore* Travis at the Hard Rock's new Kelly ... 0 yXQM5uF2jS6es16SJzNHfg
2 WTqjgwHlXbSFevF32_DJVw 0 2016-11-09 20:09:03 0 2TzJjDVDEuAW6MR5Vuc1ug 5.0 I have to say that this office really has it t... 3 n6-Gk65cPZL6Uz8qRm3NYw
3 ikCg8xy5JIg_NGPx-MSIDA 0 2018-01-09 20:56:38 0 yi0R0Ugj_xUx_Nek0-_Qig 5.0 Went in for a lunch. Steak sandwich was delici... 0 dacAIZ6fTM6mqwW5uxkskg
4 b1b1eb3uo-w561D0ZfCEiQ 0 2018-01-30 23:07:38 0 11a8sVPMUFtaC7_ABRkmtw 1.0 Today was my second out of three sessions I ha... 7 ssoyf2_x0EQMed6fgHeMyQ
In [60]:
# Nightlife 또는 Restaurants 비즈니스만 추출
# None data 삭제
# categories == None인 데이터 제거
nan_idx = biz_df.loc[biz_df.categories.isnull()].index
biz_df.drop(biz_df.index[[nan_idx]], inplace=True)

two_biz = biz_df[biz_df.apply(lambda x: 'Nightlife' in x['categories'] or 'Restaurants' in x['categories'], axis=1)]
In [62]:
# 비즈니스 데이터와 리뷰 데이터를 하나로 통합
twobiz_reviews = two_biz.merge(review_df, on='business_id', how='inner')
twobiz_reviews.head()
Out[62]:
  address attributes business_id categories city hours is_open latitude longitude name ... stars_x state cool date funny review_id stars_y text useful user_id
0 30 Eglinton Avenue W {'RestaurantsReservations': 'True', 'GoodForMe... QXAEGFB4oINsVuTFxEYKFQ Specialty Food, Restaurants, Dim Sum, Imported... Mississauga {'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W... 1 43.605499 -79.652289 Emerald Chinese Restaurant ... 2.5 ON 0 2017-01-27 21:54:30 2 6W0MQHmasK0IsaoDo4bmkw 3.0 My girlfriend and I went for dinner at Emerald... 3 2K62MJ4CJ19L8Tp5pRfjfQ
1 30 Eglinton Avenue W {'RestaurantsReservations': 'True', 'GoodForMe... QXAEGFB4oINsVuTFxEYKFQ Specialty Food, Restaurants, Dim Sum, Imported... Mississauga {'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W... 1 43.605499 -79.652289 Emerald Chinese Restaurant ... 2.5 ON 0 2013-06-24 23:11:30 0 BeeBfUxvzD4qNX4HxrgA5g 3.0 We've always been there on a Sunday so we were... 0 A0kENtCCoVT3m7T35zb2Vg
2 30 Eglinton Avenue W {'RestaurantsReservations': 'True', 'GoodForMe... QXAEGFB4oINsVuTFxEYKFQ Specialty Food, Restaurants, Dim Sum, Imported... Mississauga {'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W... 1 43.605499 -79.652289 Emerald Chinese Restaurant ... 2.5 ON 0 2016-01-04 12:59:22 0 A1D2kUnZ0HTroFreAheNSg 3.0 ***No automatic doors, not baby friendly!*** I... 0 SuOLY03LW5ZcnynKhbTydA
3 30 Eglinton Avenue W {'RestaurantsReservations': 'True', 'GoodForMe... QXAEGFB4oINsVuTFxEYKFQ Specialty Food, Restaurants, Dim Sum, Imported... Mississauga {'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W... 1 43.605499 -79.652289 Emerald Chinese Restaurant ... 2.5 ON 0 2014-05-09 02:38:43 0 2pf45Stf-pNew-xgTababQ 1.0 Horrible service,\nI went there tonight with m... 1 lymyUak6KNcNKoDbK87MiQ
4 30 Eglinton Avenue W {'RestaurantsReservations': 'True', 'GoodForMe... QXAEGFB4oINsVuTFxEYKFQ Specialty Food, Restaurants, Dim Sum, Imported... Mississauga {'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W... 1 43.605499 -79.652289 Emerald Chinese Restaurant ... 2.5 ON 2 2011-03-21 14:39:55 1 RHhlmL07evgAdPaXQV8Omg 4.0 One of the gauges of a good Chinese restaurant... 2 6vU0I5XgCv9OQHZ76rV6qw

5 rows × 22 columns

In [75]:
# 사용하지 않는 피처 제거
twobiz_reviews = twobiz_reviews[['business_id',
                                 'name',
                                 'stars_y',
                                 'text',
                                 'categories']]

# Nightlif 비즈니스면 True, 아니면 False Column 생성
twobiz_reviews['target'] = twobiz_reviews.apply(lambda x: 'Nightlife' in x['categories'], axis=1)
 

리뷰 Text를 활용하여 레스토랑 혹은 나이트 라이프 범주로 분류

  • 불균형 데이터셋 (두 범주의 리뷰 카운트에 큰 차이가 있음) => 모델이 클래스가 큰 데이터에 더 많은 시간을 사용해 피팅
  • 여기서는 더 큰 클래스를 작은 클래스와 동일한 사이즈로 다운 샘플링
In [76]:
print('Number of restaurant: ' + str(len(twobiz_reviews[twobiz_reviews['target']==False])))
print('Number of nightlife: ' + str(len(twobiz_reviews[twobiz_reviews['target']==True])))
 
Number of restaurant: 3192186
Number of nightlife: 1202166
In [77]:
# 모델에 사용하기 위해 클래스 균형을 맞춘 샘플링 데이터 생성
nightlife = twobiz_reviews[twobiz_reviews.apply(lambda x: 'Nightlife' in x['categories'], axis=1)]
restaurants = twobiz_reviews[twobiz_reviews.apply(lambda x: 'Restaurants' in x['categories'], axis=1)]
In [81]:
nightlife_subset = nightlife.sample(frac=0.1, random_state=123)
restaurant_subset = restaurants.sample(frac=0.03, random_state=123)
combined = pd.concat([nightlife_subset, restaurant_subset])
In [88]:
# 트레이닝셋과 테스트셋으로 분할
from sklearn.model_selection import train_test_split
training_data, test_data = train_test_split(combined, train_size=0.7, test_size=0.3, random_state=123)

print(training_data.shape)
print(test_data.shape)
 
(172387, 6)
(73881, 6)
 

선형 분류에 대한 BoW, tf-idf, l2 정규화

  • 트레이닝셋에 대한 스케일링 결과가 테스트셋에 깔끔하게 매핑되지 않음
  • 누락되는 데이터가 생김 (테스트셋에만 있는 단어의 경우)
    • 보통 새로운 단어는 제외
    • 혹은 'Garbage' 단어를 만들어서 트레이닝셋에서 명시적으로 학습하고 테스트셋의 이 단어들을 여기에 매핑
In [95]:
# 리뷰 텍스트를 BoW로 변환
from sklearn.feature_extraction import text

bow_transform = text.CountVectorizer()
X_tr_bow = bow_transform.fit_transform(training_data['text'])
X_te_bow = bow_transform.transform(test_data['text'])
len(bow_transform.vocabulary_)
Out[95]:
95654
In [92]:
y_tr = training_data['target']
y_te = test_data['target']
In [93]:
# BoW 행렬을 사용해 tf-idf 표현 생성
tfidf_trfm = text.TfidfTransformer(norm=None)
X_tr_tfidf = tfidf_trfm.fit_transform(X_tr_bow)
X_te_tfidf = tfidf_trfm.transform(X_te_bow)
In [98]:
# 비교를 위해 BoW 표현을 L2 정규화
import sklearn.preprocessing as preproc
X_tr_l2 = preproc.normalize(X_tr_bow, axis=0)
X_te_l2 = preproc.normalize(X_te_bow, axis=0)
 

로지스틱 회귀

  • 시그모이드 함수: 1/(1+e-wTx-b</sup>) (Range: 0~1)
  • <0.5 => False, >0.5 => True
In [110]:
from sklearn.linear_model import LogisticRegression

def simple_logistic_classify(X_tr, y_tr, X_test, y_test, description, _C=1.0):
    m = LogisticRegression(C=_C).fit(X_tr, y_tr)
    s = m.score(X_test, y_test)
    print('Test score with', description, 'features:', s)
    return m

m1 = simple_logistic_classify(X_tr_bow, y_tr, X_te_bow, y_te, 'bow')
m2 = simple_logistic_classify(X_tr_l2, y_tr, X_te_l2, y_te, 'l2-normalized')
m3 = simple_logistic_classify(X_tr_tfidf, y_tr, X_te_tfidf, y_te, 'tf-idf')
 
Test score with bow features: 0.7674909652007958
Test score with l2-normalized features: 0.7361567926801207
Test score with tf-idf features: 0.7424100919045492
 

Regularization을 통한 로지스틱 회귀 튜닝

  • Feature 수가 데이터 포인트 수보다 큰 경우 최상의 모델을 찾는 문제가 과소평가되는 경향 (Overfitting?) => Regularization으로 해결
  • Hyperparameter 튜닝 -> 그리드 검색 -> Deep learning에서는 Random한 방식
  • K-fold cross validation
  • GridSearchCV -> K-fold와 그리드 검색 함께 실행
In [102]:
# 그리드 검색을 통한 로지스틱 회귀 하이퍼파라미터 튜닝
import sklearn.model_selection as modsel

# 검색할 그리드를 지정. 각 피처 집합에 댛해 5등분 그리드 검색을 수행
param_grid_ = {'C': [1e-5, 1e-3, 1e-1, 1e0, 1e1, 1e2]}

# BoW 표현에 대한 분류기 튜닝
bow_search = modsel.GridSearchCV(LogisticRegression(), cv=5, 
                               param_grid=param_grid_)
bow_search.fit(X_tr_bow, y_tr)
Out[102]:
GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [1e-05, 0.001, 0.1, 1.0, 10.0, 100.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
In [103]:
# L2 정규화된 단어 벡터에 대한 분류기 튜닝
l2_search = modsel.GridSearchCV(LogisticRegression(), cv=5,
                               param_grid=param_grid_)
l2_search.fit(X_tr_l2, y_tr)
Out[103]:
GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [1e-05, 0.001, 0.1, 1.0, 10.0, 100.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
In [104]:
# tf-idf에 대한 분류기 튜닝
tfidf_search = modsel.GridSearchCV(LogisticRegression(), cv=5,
                                  param_grid=param_grid_)
tfidf_search.fit(X_tr_tfidf, y_tr)
Out[104]:
GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [1e-05, 0.001, 0.1, 1.0, 10.0, 100.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
In [105]:
# BoW 결과 출력
bow_search.cv_results_
 
D:\Anaconda\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('split0_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
D:\Anaconda\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('split1_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
D:\Anaconda\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('split2_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
D:\Anaconda\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('split3_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
D:\Anaconda\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('split4_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
D:\Anaconda\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('mean_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
D:\Anaconda\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('std_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
Out[105]:
{'mean_fit_time': array([  1.41952262,   4.24020009,  34.77255497, 111.57552967,
        218.28999233, 191.94038186]),
 'std_fit_time': array([1.53911039e-02, 1.55586522e-01, 1.33868671e+00, 9.82034884e+00,
        3.98804937e+01, 1.92471981e+01]),
 'mean_score_time': array([0.01456137, 0.01495895, 0.01506171, 0.01695457, 0.01576467,
        0.01476154]),
 'std_score_time': array([0.00101827, 0.00109471, 0.0011967 , 0.00109267, 0.00115247,
        0.001323  ]),
 'param_C': masked_array(data=[1e-05, 0.001, 0.1, 1.0, 10.0, 100.0],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 1e-05},
  {'C': 0.001},
  {'C': 0.1},
  {'C': 1.0},
  {'C': 10.0},
  {'C': 100.0}],
 'split0_test_score': array([0.61526191, 0.7456349 , 0.76732989, 0.76135507, 0.75076861,
        0.74702709]),
 'split1_test_score': array([0.61555195, 0.7449098 , 0.76947619, 0.76138407, 0.74676605,
        0.74862231]),
 'split2_test_score': array([0.61536677, 0.74440932, 0.76714911, 0.75931781, 0.74472837,
        0.74029063]),
 'split3_test_score': array([0.61501871, 0.74806393, 0.77152885, 0.76604693, 0.75293674,
        0.75157351]),
 'split4_test_score': array([0.61548279, 0.74577254, 0.76816428, 0.76062302, 0.74983322,
        0.7467297 ]),
 'mean_test_score': array([0.61533642, 0.74575809, 0.76872966, 0.76174538, 0.7490066 ,
        0.74684866]),
 'std_test_score': array([0.00018724, 0.00125467, 0.00162295, 0.00227766, 0.00291743,
        0.00370198]),
 'rank_test_score': array([6, 5, 1, 2, 3, 4]),
 'split0_train_score': array([0.61508676, 0.75125626, 0.81600186, 0.86204671, 0.89427086,
        0.89890435]),
 'split1_train_score': array([0.61608742, 0.75149555, 0.8147329 , 0.86211922, 0.89750488,
        0.89242907]),
 'split2_train_score': array([0.61524908, 0.75194692, 0.81548836, 0.86161265, 0.90075412,
        0.90417664]),
 'split3_train_score': array([0.61519107, 0.75136683, 0.81409615, 0.85995214, 0.89242259,
        0.891828  ]),
 'split4_train_score': array([0.61515481, 0.75185266, 0.81578566, 0.86180117, 0.89430788,
        0.89792618]),
 'mean_train_score': array([0.61535383, 0.75158365, 0.81522099, 0.86150638, 0.89585207,
        0.89705285]),
 'std_train_score': array([0.00037055, 0.00027066, 0.00070761, 0.00079767, 0.00294645,
        0.00455231])}
In [106]:
# 교차 검증 결과를 BoxPlot으로 나타내 분류기들의 성능을 시각화 및 비교
search_results = pd.DataFrame.from_dict({
    'bow': bow_search.cv_results_['mean_test_score'],
    'tfidf': tfidf_search.cv_results_['mean_test_score'],
    'l2': l2_search.cv_results_['mean_test_score']
})
In [108]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
ax = sns.boxplot(data=search_results, width=0.4)
ax.set_ylabel('Accuracy', size=14)
ax.tick_params(labelsize=14)
 
In [111]:
# 최선의 하이퍼파라미터를 사용한 모델 학습
m1 = simple_logistic_classify(X_tr_bow, y_tr, X_te_bow, y_te, 'bow',
                             _C=bow_search.best_params_['C'])
m2 = simple_logistic_classify(X_tr_l2, y_tr, X_te_l2, y_te, 'l2',
                             _C=l2_search.best_params_['C'])
m3 = simple_logistic_classify(X_tr_tfidf, y_tr, X_te_tfidf, y_te, 'tf-idf',
                             _C=tfidf_search.best_params_['C'])
 
Test score with bow features: 0.7713079140780444
Test score with l2 features: 0.7691016634858759
Test score with tf-idf features: 0.7737848702643441
 

심층 분석

  • BoW 벡터: 문서-단어 행렬 ==> 희소 행렬 (대부분 항목은 0의 값)
  • 피처 스케일링 기법 ==> 데이터 행렬의 열에 대해서 작동
  • 선형 분류기를 학습시키는 것은 데이터 행렬의 열 벡터인 피처에 대해서 최선의 선형 조합을 찾는 것
    • Column space: 데이터 행렬의 열 벡터의 선형 조합으로 만들어낼 수 있는 벡터들의 공간 (Ax =b)
    • Null space: Ax=0를 만족시키는 벡터들의 공간, 기존 데이터의 선형 조합으로는 형성될 수 없는 '새로운' 데이터 포인트
  • 데이터 행렬의 Null space는 두 가지 이유에서 매우 클 수 있음 ==> 행 공간과 열 공간의 Rank가 작음
    • 비슷한 데이터 포인트를 포함할 때 (희소 행렬)
    • 데이터 포인트 < 피처 수
  • 피처 스케일링은 데이터 행렬의 랭크 부족 문제를 해결하기 어려움
  • 피처 스케일링을 통해 행렬의 조건 수를 향상시킴으로써 선형 모델을 보다 빠르게 학습시킬 수 있음
  • 피처 엔지니어링의 효과를 분석하는 것이 어려움
728x90
반응형
LIST
Comments