[Python] 데이터 사이언스 스쿨 - 4.8 분산 분석과 모형 성능

Updated: June 11, 2021

데이터 사이언스 스쿨 자료를 토대로 공부한 내용입니다.

실습과정에서 필요에 따라 내용의 누락 및 추가, 수정사항이 있습니다.

기본 세팅

import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import warnings

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

mpl.rc('font', family='NanumGothic') # 폰트 설정
mpl.rc('axes', unicode_minus=False) # 유니코드에서 음수 부호 설정

# 차트 스타일 설정
sns.set(font="NanumGothic", rc={"axes.unicode_minus":False}, style='darkgrid')
plt.rc("figure", figsize=(10,8))

warnings.filterwarnings("ignore")

4.8 분산 분석과 모형 성능

4.8.1 분산분석 테이블

변동의 요인	제곱합	제곱합(단순)	제곱합(행렬)	자유도	평균제곱합	F
회귀	ESS	$\sum (\hat{y}-\bar{y})^{2}$	$Y^{T}(H-\dfrac{1}{n}J)Y$	k - 1	$MSR = \dfrac{ESS}{k - 1}$	$F^{*} = \dfrac{MSR}{MSE}$
오차(잔차)	RSS	$\sum (y-\hat{y})^{2}$	$Y^{T}(I-H)Y$	n - k	$MSE = \dfrac{RSS}{n - k}$
합계	TSS	$\sum (y-\bar{y})^{2}$	$Y^{T}(I-\dfrac{1}{n}J)Y$	n - 1

$\text{TSS = ESS + RSS}$
$k$는 상수항을 포함한 독립변수의 갯수다.
평균제곱합의 표기법은 임시로 정해두었다.

from sklearn.datasets import load_boston
import statsmodels.api as sm

boston = load_boston()

# 데이터 불러오기
dfX0 = pd.DataFrame(boston.data, columns=boston.feature_names)
dfX = sm.add_constant(dfX0)
dfy = pd.DataFrame(boston.target, columns=["MEDV"])

# 회귀모형 적합/예측
model_boston = sm.OLS(dfy, dfX)
result_boston = model_boston.fit()
pred = result_boston.predict(dfX)

print(result_boston.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   MEDV   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     108.1
Date:                Fri, 11 Jun 2021   Prob (F-statistic):          6.72e-135
Time:                        23:48:57   Log-Likelihood:                -1498.8
No. Observations:                 506   AIC:                             3026.
Df Residuals:                     492   BIC:                             3085.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         36.4595      5.103      7.144      0.000      26.432      46.487
CRIM          -0.1080      0.033     -3.287      0.001      -0.173      -0.043
ZN             0.0464      0.014      3.382      0.001       0.019       0.073
INDUS          0.0206      0.061      0.334      0.738      -0.100       0.141
CHAS           2.6867      0.862      3.118      0.002       0.994       4.380
NOX          -17.7666      3.820     -4.651      0.000     -25.272     -10.262
RM             3.8099      0.418      9.116      0.000       2.989       4.631
AGE            0.0007      0.013      0.052      0.958      -0.025       0.027
DIS           -1.4756      0.199     -7.398      0.000      -1.867      -1.084
RAD            0.3060      0.066      4.613      0.000       0.176       0.436
TAX           -0.0123      0.004     -3.280      0.001      -0.020      -0.005
PTRATIO       -0.9527      0.131     -7.283      0.000      -1.210      -0.696
B              0.0093      0.003      3.467      0.001       0.004       0.015
LSTAT         -0.5248      0.051    -10.347      0.000      -0.624      -0.425
==============================================================================
Omnibus:                      178.041   Durbin-Watson:                   1.078
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              783.126
Skew:                           1.521   Prob(JB):                    8.84e-171
Kurtosis:                       8.281   Cond. No.                     1.51e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

print("TSS = ", result_boston.centered_tss)
print("ESS = ", result_boston.mse_model*(13))
print("RSS = ", result_boston.ssr)
print("ESS + RSS = ", result_boston.mse_model*(13) + result_boston.ssr)

TSS =  42716.29541501977
ESS =  31637.510837064794
RSS =  11078.784577954977
ESS + RSS =  42716.29541501977

회귀분석 결과의 여러 속성을 이용해서 TSS, RSS 등을 쉽게 알 수 있다.
뒤에 나올 결정계수 등도 회귀분석 결과에 대부분 속성이 부여되어있다.

4.8.2 결정계수

회귀모형에 의해 설명되는 종속변수의 변동 정도

$R^{2} = \dfrac{ESS}{TSS} = 1 - \dfrac{RSS}{TSS}$

print("R squared = ", result_boston.mse_model*(13) / result_boston.centered_tss)

R squared =  0.7406426641094095

$\hat{y}$, $y$의 상관계수 제곱은 결정계수

temp = pd.concat([pred,dfy], axis=1)
rho = temp.corr().iloc[0,1]

print("rho squared = ", round(rho**2,3))
print("R squared = ", result_boston.rsquared.round(3))

rho squared =  0.741
R squared =  0.741

조정 결정계수

결정계수의 단점은 독립변수를 계속 추가하면 실제 독립변수의 유의성여부와 상관없이 RSS가 감소하고 ESS가 증가하여 결정계수가 높아진다.

이를 보완하기위해 조정 결정계수를 사용한다.

$R_{adj}^2 = 1 - \dfrac{n-1}{n-K} \left(1-R^2 \right)$

print("adjusted R squared = ", result_boston.rsquared_adj.round(3))

adjusted R squared =  0.734

adjR = 1 - (506 - 1) / (506 - 14) * (1 - result_boston.rsquared)
print("adjusted R squared = ", adjR.round(3))

adjusted R squared =  0.734

4.8.3 절편이 없는 모형

절편이 없는 모형에선 실제 샘플평균과 상관없이 $\bar{y}=0$ 이라는 가정하에 $\text{TSS}$를 계산한다.

이렇게 정의하지 않으면 $\text{TSS = ESS + RSS}$ 관계식이 성립하지 않아서 결정계수의 값이 1보다 커지게 된다.

from sklearn.datasets import make_regression

X0, y, coef = make_regression(
    n_samples=100, n_features=1, noise=30, bias=100, coef=True, random_state=0)
dfX = pd.DataFrame(X0, columns=["X"])
dfy = pd.DataFrame(y, columns=["Y"])
df = pd.concat([dfX, dfy], axis=1)

model2 = sm.OLS.from_formula("Y ~ X + 0", data=df)
result2 = model2.fit()
print(result2.summary())

                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                      Y   R-squared (uncentered):                   0.188
Model:                            OLS   Adj. R-squared (uncentered):              0.179
Method:                 Least Squares   F-statistic:                              22.87
Date:                Fri, 11 Jun 2021   Prob (F-statistic):                    6.03e-06
Time:                        23:48:57   Log-Likelihood:                         -604.91
No. Observations:                 100   AIC:                                      1212.
Df Residuals:                      99   BIC:                                      1214.
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
X             48.8109     10.206      4.783      0.000      28.561      69.061
==============================================================================
Omnibus:                        3.876   Durbin-Watson:                   0.204
Prob(Omnibus):                  0.144   Jarque-Bera (JB):                2.209
Skew:                          -0.092   Prob(JB):                        0.331
Kurtosis:                       2.296   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

print("기존 평균 사용시 결정계수:", result2.mse_model*(1) / result2.centered_tss)
print("평균 0으로 가정시 결정계수:", result2.mse_model*(1) / result2.uncentered_tss)
print("회귀결과 결정계수:", result2.rsquared)

기존 평균 사용시 결정계수: 0.8336322247273396
평균 0으로 가정시 결정계수: 0.18768724705943896
회귀결과 결정계수: 0.18768724705943896

회귀 결과에서도 결정계수 계산에 대해 절편이 없음을 안내하고 있다.
따라서 모형의 결정계수를 비교할 때 절편이 없는 모형과 절편이 있는 모형은 직접 비교하면 안된다.
TSS에 관한 속성은 절편이 있으면 centered_tss, 절편이 없으면 uncentered_tss를 사용한다.

4.8.4 F-검정을 이용한 모형비교

from sklearn.datasets import load_boston

boston = load_boston()
dfX0_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
dfX_boston = sm.add_constant(dfX0_boston)
dfy_boston = pd.DataFrame(boston.target, columns=["MEDV"])

df_boston = pd.concat([dfX_boston, dfy_boston], axis=1)

# 전체 모형
model_full = sm.OLS.from_formula(
    "MEDV ~ CRIM + ZN + INDUS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO + B + LSTAT + CHAS", data=df_boston)

# 축소 모형 (INDUS, AGE 제외)
model_reduced = sm.OLS.from_formula(
    "MEDV ~ CRIM + ZN + NOX + RM + DIS + RAD + TAX + PTRATIO + B + LSTAT + CHAS", data=df_boston)

# 축소 모형, 전체 모형 순으로 기입
sm.stats.anova_lm(model_reduced.fit(), model_full.fit())

	df_resid	ssr	df_diff	ss_diff	F	Pr(>F)
0	494.0	11081.363952	0.0	NaN	NaN	NaN
1	492.0	11078.784578	2.0	2.579374	0.057274	0.944342

전체 모형과 일부 변수를 제외한 축소 모형을 이용해 F-검정을 실시하여 제외한 일부 변수에 유의성을 확인 할 수 있다.
이 경우 유의수준을 0.05로 본다면 귀무가설을 채택해서 INDUS와 AGE는 유의하지 않다고 할 수 있다.
statsmodels.api의 stats.anova_lm()를 사용한다.

4.8.5 F-검정을 사용한 변수 중요도 비교

model_boston = sm.OLS.from_formula(
    "MEDV ~ CRIM + ZN + INDUS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO + B + LSTAT + CHAS", data=df_boston)
result_boston = model_boston.fit()
sm.stats.anova_lm(result_boston, typ=2)

	sum_sq	df	F	PR(>F)
CRIM	243.219699	1.0	10.801193	1.086810e-03
ZN	257.492979	1.0	11.435058	7.781097e-04
INDUS	2.516668	1.0	0.111763	7.382881e-01
NOX	487.155674	1.0	21.634196	4.245644e-06
RM	1871.324082	1.0	83.104012	1.979441e-18
AGE	0.061834	1.0	0.002746	9.582293e-01
DIS	1232.412493	1.0	54.730457	6.013491e-13
RAD	479.153926	1.0	21.278844	5.070529e-06
TAX	242.257440	1.0	10.758460	1.111637e-03
PTRATIO	1194.233533	1.0	53.034960	1.308835e-12
B	270.634230	1.0	12.018651	5.728592e-04
LSTAT	2410.838689	1.0	107.063426	7.776912e-23
CHAS	218.970357	1.0	9.724299	1.925030e-03
Residual	11078.784578	492.0	NaN	NaN

typ를 2로 설정하면 변수를 하나씩 제외해서 F-검정을 한번에 실시한다.
이때 p-value는 전체 모형에서 단일 회귀계수의 t-검정의 p-value와 같다.

Share on

Twitter Facebook LinkedIn

Romg2

[Python] 데이터 사이언스 스쿨 - 4.8 분산 분석과 모형 성능

4.8 분산 분석과 모형 성능

4.8.1 분산분석 테이블

4.8.2 결정계수

4.8.3 절편이 없는 모형

4.8.4 F-검정을 이용한 모형비교

4.8.5 F-검정을 사용한 변수 중요도 비교

Share on

Leave a comment

You may also enjoy

[OPGG] 인턴 연계 과정 - 미니맵 챔피언 인식

[OPGG] 인턴 연계 과정 - 프로 리그 데이터 수집

[OPGG] 파이널 프로젝트 - 포지션 예측

[Python] 코딩 도장 - ASCII Art N