[Python] 데이터 사이언스 스쿨 - 4.4 범주형 독립변수
Updated:
데이터 사이언스 스쿨 자료를 토대로 공부한 내용입니다.
실습과정에서 필요에 따라 내용의 누락 및 추가, 수정사항이 있습니다.
기본 세팅
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
mpl.rc('font', family='NanumGothic') # 폰트 설정
mpl.rc('axes', unicode_minus=False) # 유니코드에서 음수 부호 설정
# 차트 스타일 설정
sns.set(font="NanumGothic", rc={"axes.unicode_minus":False}, style='darkgrid')
plt.rc("figure", figsize=(10,8))
warnings.filterwarnings("ignore")
4.4 범주형 독립변수
4.4.1 축소랭크(Reduce-Rank)
from sklearn.datasets import load_boston
import statsmodels.api as sm
boston = load_boston()
dfX = pd.DataFrame(boston.data, columns=boston.feature_names)
dfy = pd.DataFrame(boston.target, columns=["MEDV"])
df_boston = pd.concat([dfX, dfy], axis=1)
model1 = sm.OLS.from_formula("MEDV ~ " + "+".join(boston.feature_names), data=df_boston)
result1 = model1.fit()
print(result1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: MEDV R-squared: 0.741
Model: OLS Adj. R-squared: 0.734
Method: Least Squares F-statistic: 108.1
Date: Fri, 11 Jun 2021 Prob (F-statistic): 6.72e-135
Time: 19:19:43 Log-Likelihood: -1498.8
No. Observations: 506 AIC: 3026.
Df Residuals: 492 BIC: 3085.
Df Model: 13
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 36.4595 5.103 7.144 0.000 26.432 46.487
CRIM -0.1080 0.033 -3.287 0.001 -0.173 -0.043
ZN 0.0464 0.014 3.382 0.001 0.019 0.073
INDUS 0.0206 0.061 0.334 0.738 -0.100 0.141
CHAS 2.6867 0.862 3.118 0.002 0.994 4.380
NOX -17.7666 3.820 -4.651 0.000 -25.272 -10.262
RM 3.8099 0.418 9.116 0.000 2.989 4.631
AGE 0.0007 0.013 0.052 0.958 -0.025 0.027
DIS -1.4756 0.199 -7.398 0.000 -1.867 -1.084
RAD 0.3060 0.066 4.613 0.000 0.176 0.436
TAX -0.0123 0.004 -3.280 0.001 -0.020 -0.005
PTRATIO -0.9527 0.131 -7.283 0.000 -1.210 -0.696
B 0.0093 0.003 3.467 0.001 0.004 0.015
LSTAT -0.5248 0.051 -10.347 0.000 -0.624 -0.425
==============================================================================
Omnibus: 178.041 Durbin-Watson: 1.078
Prob(Omnibus): 0.000 Jarque-Bera (JB): 783.126
Skew: 1.521 Prob(JB): 8.84e-171
Kurtosis: 8.281 Cond. No. 1.51e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
-
0과 1로 구성된 범주형 변수 CHAS는 축소랭크방식으로 회귀 계수가 구해졌다.
-
CHAS가 0인 경우 Intercept는 36.4595, CHAS가 1인 경우 Intercept는 39.1462가 된다.
-
CHAS의 회귀계수 2.6867은 0을 기준으로 1일 때 절편이 얼마나 커지는지를 의미한다.
두 개 이상의 범주형 변수가 있는 경우
- 축소랭크 방식을 이용한다.
4.4.2 풀랭크(Full-Rank)
feature_names = list(boston.feature_names)
feature_names.remove("CHAS")
feature_names = [name for name in feature_names] + ["C(CHAS)"]
model2 = sm.OLS.from_formula("MEDV ~ 0 + " + "+".join(feature_names), data=df_boston)
result2 = model2.fit()
print(result2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: MEDV R-squared: 0.741
Model: OLS Adj. R-squared: 0.734
Method: Least Squares F-statistic: 108.1
Date: Fri, 11 Jun 2021 Prob (F-statistic): 6.72e-135
Time: 19:19:43 Log-Likelihood: -1498.8
No. Observations: 506 AIC: 3026.
Df Residuals: 492 BIC: 3085.
Df Model: 13
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
C(CHAS)[0.0] 36.4595 5.103 7.144 0.000 26.432 46.487
C(CHAS)[1.0] 39.1462 5.153 7.597 0.000 29.023 49.270
CRIM -0.1080 0.033 -3.287 0.001 -0.173 -0.043
ZN 0.0464 0.014 3.382 0.001 0.019 0.073
INDUS 0.0206 0.061 0.334 0.738 -0.100 0.141
NOX -17.7666 3.820 -4.651 0.000 -25.272 -10.262
RM 3.8099 0.418 9.116 0.000 2.989 4.631
AGE 0.0007 0.013 0.052 0.958 -0.025 0.027
DIS -1.4756 0.199 -7.398 0.000 -1.867 -1.084
RAD 0.3060 0.066 4.613 0.000 0.176 0.436
TAX -0.0123 0.004 -3.280 0.001 -0.020 -0.005
PTRATIO -0.9527 0.131 -7.283 0.000 -1.210 -0.696
B 0.0093 0.003 3.467 0.001 0.004 0.015
LSTAT -0.5248 0.051 -10.347 0.000 -0.624 -0.425
==============================================================================
Omnibus: 178.041 Durbin-Watson: 1.078
Prob(Omnibus): 0.000 Jarque-Bera (JB): 783.126
Skew: 1.521 Prob(JB): 8.84e-171
Kurtosis: 8.281 Cond. No. 2.01e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.01e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
-
풀랭크 방식을 사용했을 때는 각각의 절편값으로 출력되는 것을 확인할 수 있다.
-
절편값 자체는 축소랭크 방식과 똑같다.
4.4.3 범주형 독립변수와 실수 독립변수의 상호작용
절편이 같고 기울기가 다른 모형
model3 = sm.OLS.from_formula("MEDV ~ CRIM + C(CHAS):CRIM", data=df_boston)
result3 = model3.fit()
print(result3.summary())
OLS Regression Results
==============================================================================
Dep. Variable: MEDV R-squared: 0.174
Model: OLS Adj. R-squared: 0.171
Method: Least Squares F-statistic: 53.15
Date: Fri, 11 Jun 2021 Prob (F-statistic): 1.15e-21
Time: 19:19:43 Log-Likelihood: -1791.7
No. Observations: 506 AIC: 3589.
Df Residuals: 503 BIC: 3602.
Df Model: 2
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 23.8231 0.408 58.452 0.000 23.022 24.624
CRIM -0.4198 0.043 -9.688 0.000 -0.505 -0.335
C(CHAS)[T.1.0]:CRIM 1.7699 0.466 3.799 0.000 0.854 2.685
==============================================================================
Omnibus: 121.360 Durbin-Watson: 0.751
Prob(Omnibus): 0.000 Jarque-Bera (JB): 231.021
Skew: 1.351 Prob(JB): 6.83e-51
Kurtosis: 4.912 Cond. No. 12.0
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
-
CHAS가 0일 때 CRIM의 기울기는 -0.4198이다.
-
CHAS가 1일 때 CRIM의 기울기는 1.7699 - 0.4198 = 1.3501이다.
-
절편은 23.8231로 CHAS와 상관없이 동일하다.
절편과 기울기 모두 다른 모형
model4 = sm.OLS.from_formula("MEDV ~ C(CHAS) + CRIM + C(CHAS):CRIM", data=df_boston)
result4 = model4.fit()
print(result4.summary())
OLS Regression Results
==============================================================================
Dep. Variable: MEDV R-squared: 0.181
Model: OLS Adj. R-squared: 0.176
Method: Least Squares F-statistic: 36.87
Date: Fri, 11 Jun 2021 Prob (F-statistic): 1.52e-21
Time: 19:19:43 Log-Likelihood: -1789.9
No. Observations: 506 AIC: 3588.
Df Residuals: 502 BIC: 3605.
Df Model: 3
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 23.6377 0.418 56.595 0.000 22.817 24.458
C(CHAS)[T.1.0] 3.5036 1.816 1.930 0.054 -0.064 7.071
CRIM -0.4123 0.043 -9.502 0.000 -0.498 -0.327
C(CHAS)[T.1.0]:CRIM 1.1137 0.576 1.934 0.054 -0.018 2.245
==============================================================================
Omnibus: 120.101 Durbin-Watson: 0.767
Prob(Omnibus): 0.000 Jarque-Bera (JB): 226.558
Skew: 1.343 Prob(JB): 6.36e-50
Kurtosis: 4.879 Cond. No. 46.5
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
-
CHAS가 0일 때 절편은 23.6377, CRIM의 기울기는 -0.4123이다.
-
CHAS가 1일 때 절편은 23.6377 + 3.5036 = 27.1413, CRIM의 기울기는 1.1137 - 0.4123 = 0.7014이다.
Leave a comment