catboost encoder 방식 (CatBoostEncoder in categorical-encodings)

250x250

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

데이터과학 삼학년

catboost encoder 방식 (CatBoostEncoder in categorical-encodings) 본문

Machine Learning

catboost encoder 방식 (CatBoostEncoder in categorical-encodings)

Dan-k 2023. 6. 26. 13:00

CatBoost는 카테고리 변수를 별도 처리하지 않아도 지정만 해주면(indices 등) 자동으로 encoding처리를 해준다.

그렇다면 CatBoost에서 사용하는 카테고리 변수를 인코딩 방법은 무엇일까?!

주요 용어

* TargetSum: Sum of the target value for that particular categorical feature in the whole dataset.

- encoding 변환시키고자 하는 카테고리 변수에 할당된 타겟값의 합

* Prior: (sum of target values in the whole dataset)/ ( total number of observations (i.e. rows) in the dataset)

- 전체 데이터셋의 타겟(y)값 총 합 / 총 데이터수 → Target 값의 평균

* FeatureCount: Total number of categorical features observed upto the current one with the same value as the current one.

- 전체 데이터에서 encoding 변환시키고자하는 카테고리 값의 갯수

Fomula

Encoding Value = (TargetSum + Prior) / (FeatureCount + 1)

Prior = TargetSum / TargetCnt = Target 값의 평균

>> 개념적으로 인코딩하고자 하는 카테고리 변수에 매칭되는 타겟값 평균

예시

color=[“red”, “blue”, “blue”, “green”, “red”, “red”, “black”, “black”, “blue”, “green”]

target=[1, 2, 3, 2, 3, 1, 4, 4, 2, 3]

prior은 target값의 평균으로 25/10 = 2.5

“red” category encoding한다면

TargetCount = 1+3+1 = 5 (target에서 ”red”에 매칭되는 값의 합)

FeatureCount = 3 (전체 데이터에서 “red”의 갯수)

최종 encoded value for “red” = (5+2.5) /(3+1)=1.875

code

# import libraries
import category_encoders as ce
import pandas as pd
  
# Make dataset
train = pd.DataFrame({
    'color': ["red", "blue", "blue", "green", "red",
              "red", "black", "black", "blue", "green"],
    
    'interests': ["sketching", "painting", "instruments",
                  "sketching", "painting", "video games",
                  "painting", "instruments", "sketching",
                  "sketching"],
    
    'height': [68, 64, 87, 45, 54, 64, 67, 98, 90, 87],
    
    'grade': [1, 2, 3, 2, 3, 1, 4, 4, 2, 3], })
  
# Define train and target
target = train[['grade']]
train = train.drop('grade', axis = 1)
  
# Define catboost encoder
cbe_encoder = ce.cat_boost.CatBoostEncoder()
  
# Fit encoder and transform the features
cbe_encoder.fit(train, target)
train_cbe = cbe_encoder.transform(train)
  
# We can use fit_transform() instead of fit()
# and transform() separately as follows:
# train_cbe = cbe_encoder.fit_transform(train,target)

참조

https://www.geeksforgeeks.org/categorical-encoding-with-catboost-encoder/

Categorical Encoding with CatBoost Encoder - GeeksforGeeks

A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.

www.geeksforgeeks.org

728x90

LIST

'Machine Learning' 카테고리의 다른 글

SOTA (State-of-the-Art) 가장 최신의 성능 좋은 모델들...어떻게 확인?! (0)	2023.08.06
SCIKIT_LLM (sklearn + llm), large language model을 쉽게 쓰자!!! (0)	2023.07.31
[AutoML] Auto-sklearn (0)	2023.06.19
히스토그램 기반 그래디언트 부스팅 트리(Histogram Gradient Boosting Tree) (0)	2023.05.25
LightGBM vs CatBoost vs XGBoost (1)	2023.05.12

'Machine Learning' Related Articles

Comments

데이터과학 삼학년

catboost encoder 방식 (CatBoostEncoder in categorical-encodings) 본문

catboost encoder 방식 (CatBoostEncoder in categorical-encodings)

'Machine Learning' 카테고리의 다른 글

티스토리툴바