[Python] 데이터 사이언스 스쿨 - 4.4 범주형 독립변수

Updated:

데이터 사이언스 스쿨 자료를 토대로 공부한 내용입니다.

실습과정에서 필요에 따라 내용의 누락 및 추가, 수정사항이 있습니다.


기본 세팅

import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

mpl.rc('font', family='NanumGothic') # 폰트 설정
mpl.rc('axes', unicode_minus=False) # 유니코드에서 음수 부호 설정

# 차트 스타일 설정
sns.set(font="NanumGothic", rc={"axes.unicode_minus":False}, style='darkgrid')
plt.rc("figure", figsize=(10,8))

warnings.filterwarnings("ignore")

4.4 범주형 독립변수

4.4.1 축소랭크(Reduce-Rank)

from sklearn.datasets import load_boston
import statsmodels.api as sm

boston = load_boston()

dfX = pd.DataFrame(boston.data, columns=boston.feature_names)
dfy = pd.DataFrame(boston.target, columns=["MEDV"])
df_boston = pd.concat([dfX, dfy], axis=1)

model1 = sm.OLS.from_formula("MEDV ~ " + "+".join(boston.feature_names), data=df_boston)
result1 = model1.fit()
print(result1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   MEDV   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     108.1
Date:                Fri, 11 Jun 2021   Prob (F-statistic):          6.72e-135
Time:                        19:19:43   Log-Likelihood:                -1498.8
No. Observations:                 506   AIC:                             3026.
Df Residuals:                     492   BIC:                             3085.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     36.4595      5.103      7.144      0.000      26.432      46.487
CRIM          -0.1080      0.033     -3.287      0.001      -0.173      -0.043
ZN             0.0464      0.014      3.382      0.001       0.019       0.073
INDUS          0.0206      0.061      0.334      0.738      -0.100       0.141
CHAS           2.6867      0.862      3.118      0.002       0.994       4.380
NOX          -17.7666      3.820     -4.651      0.000     -25.272     -10.262
RM             3.8099      0.418      9.116      0.000       2.989       4.631
AGE            0.0007      0.013      0.052      0.958      -0.025       0.027
DIS           -1.4756      0.199     -7.398      0.000      -1.867      -1.084
RAD            0.3060      0.066      4.613      0.000       0.176       0.436
TAX           -0.0123      0.004     -3.280      0.001      -0.020      -0.005
PTRATIO       -0.9527      0.131     -7.283      0.000      -1.210      -0.696
B              0.0093      0.003      3.467      0.001       0.004       0.015
LSTAT         -0.5248      0.051    -10.347      0.000      -0.624      -0.425
==============================================================================
Omnibus:                      178.041   Durbin-Watson:                   1.078
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              783.126
Skew:                           1.521   Prob(JB):                    8.84e-171
Kurtosis:                       8.281   Cond. No.                     1.51e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
  • 0과 1로 구성된 범주형 변수 CHAS는 축소랭크방식으로 회귀 계수가 구해졌다.

  • CHAS가 0인 경우 Intercept는 36.4595, CHAS가 1인 경우 Intercept는 39.1462가 된다.

  • CHAS의 회귀계수 2.6867은 0을 기준으로 1일 때 절편이 얼마나 커지는지를 의미한다.

두 개 이상의 범주형 변수가 있는 경우

  • 축소랭크 방식을 이용한다.

4.4.2 풀랭크(Full-Rank)

feature_names = list(boston.feature_names)
feature_names.remove("CHAS") 
feature_names = [name for name in feature_names] + ["C(CHAS)"]
model2 = sm.OLS.from_formula("MEDV ~ 0 + " + "+".join(feature_names), data=df_boston)
result2 = model2.fit()
print(result2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   MEDV   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     108.1
Date:                Fri, 11 Jun 2021   Prob (F-statistic):          6.72e-135
Time:                        19:19:43   Log-Likelihood:                -1498.8
No. Observations:                 506   AIC:                             3026.
Df Residuals:                     492   BIC:                             3085.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
C(CHAS)[0.0]    36.4595      5.103      7.144      0.000      26.432      46.487
C(CHAS)[1.0]    39.1462      5.153      7.597      0.000      29.023      49.270
CRIM            -0.1080      0.033     -3.287      0.001      -0.173      -0.043
ZN               0.0464      0.014      3.382      0.001       0.019       0.073
INDUS            0.0206      0.061      0.334      0.738      -0.100       0.141
NOX            -17.7666      3.820     -4.651      0.000     -25.272     -10.262
RM               3.8099      0.418      9.116      0.000       2.989       4.631
AGE              0.0007      0.013      0.052      0.958      -0.025       0.027
DIS             -1.4756      0.199     -7.398      0.000      -1.867      -1.084
RAD              0.3060      0.066      4.613      0.000       0.176       0.436
TAX             -0.0123      0.004     -3.280      0.001      -0.020      -0.005
PTRATIO         -0.9527      0.131     -7.283      0.000      -1.210      -0.696
B                0.0093      0.003      3.467      0.001       0.004       0.015
LSTAT           -0.5248      0.051    -10.347      0.000      -0.624      -0.425
==============================================================================
Omnibus:                      178.041   Durbin-Watson:                   1.078
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              783.126
Skew:                           1.521   Prob(JB):                    8.84e-171
Kurtosis:                       8.281   Cond. No.                     2.01e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.01e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
  • 풀랭크 방식을 사용했을 때는 각각의 절편값으로 출력되는 것을 확인할 수 있다.

  • 절편값 자체는 축소랭크 방식과 똑같다.

4.4.3 범주형 독립변수와 실수 독립변수의 상호작용

절편이 같고 기울기가 다른 모형

model3 = sm.OLS.from_formula("MEDV ~ CRIM + C(CHAS):CRIM", data=df_boston)
result3 = model3.fit()
print(result3.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   MEDV   R-squared:                       0.174
Model:                            OLS   Adj. R-squared:                  0.171
Method:                 Least Squares   F-statistic:                     53.15
Date:                Fri, 11 Jun 2021   Prob (F-statistic):           1.15e-21
Time:                        19:19:43   Log-Likelihood:                -1791.7
No. Observations:                 506   AIC:                             3589.
Df Residuals:                     503   BIC:                             3602.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept              23.8231      0.408     58.452      0.000      23.022      24.624
CRIM                   -0.4198      0.043     -9.688      0.000      -0.505      -0.335
C(CHAS)[T.1.0]:CRIM     1.7699      0.466      3.799      0.000       0.854       2.685
==============================================================================
Omnibus:                      121.360   Durbin-Watson:                   0.751
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              231.021
Skew:                           1.351   Prob(JB):                     6.83e-51
Kurtosis:                       4.912   Cond. No.                         12.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  • CHAS가 0일 때 CRIM의 기울기는 -0.4198이다.

  • CHAS가 1일 때 CRIM의 기울기는 1.7699 - 0.4198 = 1.3501이다.

  • 절편은 23.8231로 CHAS와 상관없이 동일하다.

절편과 기울기 모두 다른 모형

model4 = sm.OLS.from_formula("MEDV ~ C(CHAS) + CRIM + C(CHAS):CRIM", data=df_boston)
result4 = model4.fit()
print(result4.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   MEDV   R-squared:                       0.181
Model:                            OLS   Adj. R-squared:                  0.176
Method:                 Least Squares   F-statistic:                     36.87
Date:                Fri, 11 Jun 2021   Prob (F-statistic):           1.52e-21
Time:                        19:19:43   Log-Likelihood:                -1789.9
No. Observations:                 506   AIC:                             3588.
Df Residuals:                     502   BIC:                             3605.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept              23.6377      0.418     56.595      0.000      22.817      24.458
C(CHAS)[T.1.0]          3.5036      1.816      1.930      0.054      -0.064       7.071
CRIM                   -0.4123      0.043     -9.502      0.000      -0.498      -0.327
C(CHAS)[T.1.0]:CRIM     1.1137      0.576      1.934      0.054      -0.018       2.245
==============================================================================
Omnibus:                      120.101   Durbin-Watson:                   0.767
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              226.558
Skew:                           1.343   Prob(JB):                     6.36e-50
Kurtosis:                       4.879   Cond. No.                         46.5
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  • CHAS가 0일 때 절편은 23.6377, CRIM의 기울기는 -0.4123이다.

  • CHAS가 1일 때 절편은 23.6377 + 3.5036 = 27.1413, CRIM의 기울기는 1.1137 - 0.4123 = 0.7014이다.

Leave a comment