범주형 변수와 연속형 변수간 상관관계(categorical numerical correlation)

250x250

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

데이터과학 삼학년

범주형 변수와 연속형 변수간 상관관계(categorical numerical correlation) 본문

Statistical Learning

범주형 변수와 연속형 변수간 상관관계(categorical numerical correlation)

Dan-k 2023. 9. 25. 09:00

범주형 변수 - 연속형변수 간 상관관계

평균 비교 그래프 (Box Plot 또는 Violin Plot)

- 연속형 변수를 범주형 변수의 각 수준에 따라 상자 그림 또는 바이올린 그림으로 그릴 수 있음

- 이러한 그래프는 범주형 변수의 각 수준에서 연속형 변수의 분포와 중앙값을 시각적으로 비교

import seaborn as sns
import matplotlib.pyplot as plt

# 범주형 변수
category = np.array(['A', 'B', 'A', 'B', 'A'])
# 연속형 변수
continuous = np.array([10, 15, 12, 18, 8])

# Box Plot 또는 Violin Plot 그리기
sns.boxplot(x=category, y=continuous)
plt.show()

상관 계수 (Correlation Coefficient)

- 범주형 변수가 이진 변수 (두 가지 값만 가지는 변수)인 경우, 포인트 바이시리얼 혹은 스피어먼 상관계수 등의 비모수적인 상관 계수를 사용하여 상관관계를 측정 가능

- 이러한 상관 계수는 두 변수 간의 순위 또는 순서에 기반하여 상관 관계를 측정

from scipy import stats

# 범주형 변수 (이진 변수)
category = np.array([0, 1, 0, 1, 0])
# 연속형 변수
continuous = np.array([10, 15, 12, 18, 8])

# 스피어맨 상관계수 계산
correlation, p_value = stats.spearmanr(category, continuous)

print("Spearman Correlation:", correlation)
print("p-value:", p_value)

from scipy import stats
corr_list = {}
y = df['target'].astype(float)
for column in df
    x = df[column].astype(float)
    corr = stats.pointbiserialr(x, y)
    corr_list[column] = corr 
print(corr_list)


df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
print(df)
from scipy import stats
corr_list = []
y = df['A'].astype(float)


for column in df:
    x=df[column]
    corr = stats.pointbiserialr(list(x), list(y))
    corr_list.append(corr[0])
print(corr_list)

ANOVA (Analysis of Variance)

- ANOVA는 범주형 변수와 연속형 변수 간의 평균값의 차이를 검정하는 통계적 방법 중 하나

- ANOVA를 사용하여 범주형 변수의 각 수준(범주)에 대한 연속형 변수의 평균값을 비교

- ANOVA 결과로 F-통계량(F-statistic)과 p-value를 얻을 수 있으며, p-value가 유의수준보다 작을 때 범주형 변수와 연속형 변수 간에는 통계적으로 유의미한 차이가 있다고 판단

import numpy as np
from scipy import stats

# 범주형 변수
category = np.array(['A', 'B', 'A', 'B', 'A'])
# 연속형 변수
continuous = np.array([10, 15, 12, 18, 8])

# ANOVA 수행
f_statistic, p_value = stats.f_oneway(continuous[category == 'A'], continuous[category == 'B'])

print("F-statistic:", f_statistic)
print("p-value:", p_value)

카이제곱 검정 (Chi-Square Test)

- 카이제곱 검정은 범주형 변수와 범주형 변수 간의 연관성을 검정하는 방법 중 하나

- 하지만 범주형 변수와 연속형 변수 간의 상관관계를 측정하기 위해서는 일반적으로 다른 방법을 사용

- 범주형 변수를 이산적인 구간으로 나눈 후, 각 구간에서 연속형 변수의 분포를 확인하고 비교하는 방식으로 상관관계를 평가

- 이를 위해 범주화된 데이터로 카이제곱 검정 또는 다른 적절한 비모수 검정을 수행

from scipy import stats

# 범주형 변수
category = np.array(['A', 'B', 'A', 'B', 'A'])
# 연속형 변수
continuous = np.array([1, 0, 1, 1, 0])

# 카이제곱 검정 수행
chi2_stat, p_value, dof, expected = stats.chi2_contingency(pd.crosstab(category, continuous))

print("Chi-Square Statistic:", chi2_stat)
print("p-value:", p_value)

Kruskal-Wallis 검정

- ANOVA와 유사하게, Kruskal-Wallis 검정은 범주형 변수와 연속형 변수 간의 평균값의 차이를 검정하는 비모수 검정 방법

- 이 방법은 ANOVA를 대체할 수 있으며, 데이터가 정규 분포를 따르지 않을 때 사용 가능

from scipy import stats

# 범주형 변수
category = np.array(['A', 'B', 'A', 'B', 'A'])
# 연속형 변수
continuous = np.array([10, 15, 12, 18, 8])

# Kruskal-Wallis 검정 수행
h_statistic, p_value = stats.kruskal(continuous[category == 'A'], continuous[category == 'B'])

print("Kruskal-Wallis Statistic:", h_statistic)
print("p-value:", p_value)

728x90

LIST

'Statistical Learning' 카테고리의 다른 글

범주형 변수 상관관계?! -> cross tab with chi square (1)	2024.11.13
카파 통계량 (Kappa-statistics) (0)	2024.02.22
SMOTENC :: oversampling with categorical variable (0)	2023.07.25
smoothing 기법 (0)	2023.07.11
pandas stratified sampling (층화표본) (0)	2023.06.08

'Statistical Learning' Related Articles

Comments

데이터과학 삼학년

범주형 변수와 연속형 변수간 상관관계(categorical numerical correlation) 본문

범주형 변수와 연속형 변수간 상관관계(categorical numerical correlation)

'Statistical Learning' 카테고리의 다른 글

티스토리툴바