[Python] 머신러닝 완벽가이드 - 09. 추천 시스템[콘텐츠 기반]

Updated: June 24, 2021

파이썬 머신러닝 완벽가이드 교재를 토대로 공부한 내용입니다.

실습과정에서 필요에 따라 내용의 누락 및 추가, 수정사항이 있습니다.

기본 세팅

import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import warnings

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

mpl.rc('font', family='NanumGothic') # 폰트 설정
mpl.rc('axes', unicode_minus=False) # 유니코드에서 음수 부호 설정

# 차트 스타일 설정
sns.set(font="NanumGothic", rc={"axes.unicode_minus":False}, style='darkgrid')
plt.rc("figure", figsize=(10,8))

warnings.filterwarnings("ignore")

1. 콘텐츠 기반 필터링

추천 시스템은 크게 콘텐츠 기반 필터링 방식과 협업 필터링 방식으로 나뉜다.

그리고 협업 필터링 방식은 다시 최근접 이웃 협업 필터링과 잠재 요인 협업 필터링으로 나뉜다.

콘텐츠 기반 필터링은 사용자가 특정 아이템을 선호하는 경우, 그와 비슷한 콘텐츠를 가진 다른 아이템을 추천하는 방식이다.

예를 들어, 특정 영화에 높은 평점을 줬다면 그 영화의 장르, 배우,감독, 키워드 등이 유사한 다른 영화를 추천한다.

1.1 데이터 로딩 및 가공

영화 데이터 정보 사이트인 IMBD는 많은 영화에 대한 정보를 제공한다.

그 중 주요 5,000개의 영화에 대한 메타 정보를 가공해서 제공하는 캐글의 TMDB 5000 데이터를 사용한다.

movies = pd.read_csv("tmdb_5000_movies.csv")
movies.head(1)

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.avatarmovie.com/	19995	[{"id": 1463, "name": "culture clash"}, {"id":...	en	Avatar	In the 22nd century, a paraplegic Marine is di...	150.437577	[{"name": "Ingenious Film Partners", "id": 289...	[{"iso_3166_1": "US", "name": "United States o...	2009-12-10	2787965087	162.0	[{"iso_639_1": "en", "name": "English"}, {"iso...	Released	Enter the World of Pandora.	Avatar	7.2	11800

movies.shape

(4803, 20)

데이터는 4,803 x 20으로 이루어져 있다.
이 중 주요 컬럼만 추출해 새롭게 데이터 프레임을 만든다.

# 주요 컬럼으로 데이터 프레임 생성
col_lst = ['id', 'title', 'genres', 'vote_average', 'vote_count', 'popularity', 'keywords', 'overview']
movies_df = movies[col_lst]

id: 아이디
title: 영화 제목
genres: 영화가 속한 여러가지 장르
vote_average: 평균 평점
vote_count: 평점 투표 수
popularity: 영화의 인기 정도
keywords: 영화를 설명하는 주요 키워드 문구
overview: 영화에 대한 개요 설명

# 컬럼 길이 늘려서 출력
pd.set_option('max_colwidth', 80)
movies_df[['genres','keywords']][:1]

	genres	keywords
0	[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "...	[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id...

genres, keywords 등은 리스트 내에 여러 개의 사전으로 이루어진 형태로 입력되어 있다.
각 장르, 키워드 명칭은 사전의 키인 name으로 추출 가능하다.

# 옵션 초기화
pd.reset_option("max_colwidth")

movies_df['genres'][0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

다만 현재 형태가 리스트 내에 사전인 것이지 실제론 전체가 문자열로 입력되어 있다.

from ast import literal_eval

movies_df['genres'].apply(literal_eval)[0]

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

ast의 literal_eval()은 문자 그대로 evaluate 해준다.
따라서 여기선 문자열이 아닌 실제 리스트 내에 사전을 가진 객체로 만든다.
이를 이용해 원하는 정보만 추출해보자.

from ast import literal_eval

# 문자열을 객체로 변경: 리스트 내의 사전
movies_df['genres'] = movies_df['genres'].apply(literal_eval)
movies_df['keywords'] = movies_df['keywords'].apply(literal_eval)

# 객체에서 name만 추출: 사전 마다 name을 추출
movies_df['genres'] = movies_df['genres'].apply(lambda x : [ dic['name'] for dic in x] )
movies_df['keywords'] = movies_df['keywords'].apply(lambda x : [ dic['name'] for dic in x] )

movies_df[['genres', 'keywords']][:1]

	genres	keywords
0	[Action, Adventure, Fantasy, Science Fiction]	[culture clash, future, space war, space colon...

문자열을 객체로 만든 후 각 사전마다 name을 추출하여 리스트 객체로 변환하였다.
리스트 형태의 문자열이 아닌 리스트 객체이다.

1.2 장르 콘텐츠 유사도 측정

이번엔 장르별 유사도를 측정해보자.

유사도 측정은 문서 유사도에서 사용한 코사인 유사도를 사용한다.

from sklearn.feature_extraction.text import CountVectorizer

# 리스트 객체를 문자열로 변환: 공백으로 구분
movies_df['genres_literal'] = movies_df['genres'].apply(lambda x : (' ').join(x))

# CountVectorizer
count_vect = CountVectorizer(min_df=0, ngram_range=(1,2))
genre_mat = count_vect.fit_transform(movies_df['genres_literal'])

print(genre_mat.shape)

(4803, 276)

우선 리스트 객체를 문자열로 변경후 Count 피처 벡터화를 적용하였다.

from sklearn.metrics.pairwise import cosine_similarity

genre_sim = cosine_similarity(genre_mat, genre_mat)
genre_sim[0]

array([1.        , 0.59628479, 0.4472136 , ..., 0.        , 0.        ,
       0.        ])

cosine_similarity()를 이용해 장르 유사도 행렬을 생성하였다.
결과는 첫 번째 영화와 다른 영화의 장르 유사도만 확인하였다.

def find_sim_movie(df, sim_matrix, title_name, top_n=10):
    
    # 입력한 영화의 index
    title_movie = df[df['title'] == title_name]
    title_index = title_movie.index.values
    
    # 입력한 영화의 유사도 데이터 프레임 추가
    df["similarity"] = sim_matrix[title_index, :].reshape(-1,1)
    
    # 유사도 내림차순 정렬 후 상위 index 추출
    temp = df.sort_values(by="similarity", ascending=False)
    final_index = temp.index.values[ : top_n]
    
    return df.iloc[final_index]

특정 영화(movies_df의 특정 행 index)와 장르 유사도가 높은 영화의 정보를 얻기 위한 함수를 생성하였다.
교재와는 다른 방식으로 함수를 작성하여 특정 영화의 유사도를 데이터 프레임에 추가하였다.

# The Godfather(대부)와 장르별 유사도가 높은 영화 10개
similar_movies = find_sim_movie(movies_df, genre_sim, 'The Godfather', 10)
similar_movies[['title', 'vote_average', "similarity"]]

	title	vote_average	similarity
3636	Light Sleeper	5.7	1.0
892	Casino	7.8	1.0
3866	City of God	8.1	1.0
1243	Mean Streets	7.2	1.0
1370	21	6.5	1.0
4041	This Is England	7.4	1.0
1847	GoodFellas	8.2	1.0
2582	The Place Beyond the Pines	6.8	1.0
1946	The Bad Lieutenant: Port of Call - New Orleans	6.0	1.0
4217	Kids	6.8	1.0

교재와 결과가 많이 다른데 정말 여러 시도를 해보았다.
교재의 경우 자기 자신의 문서 유사도를 제거하지 않는다(나중에 새로운 함수로 제거한다..).
만약 첫 번째 index 영화 Avatar로 함수를 실행하면 첫 번째 추천이 Avatar로 나타난다.
자기 자신 index만 제거하고 정렬을 하면 또 순서가 다르다.
유사도가 같은 값이 매우 많아 특정 값을 제거하니 정렬 순서가 달라지는 듯 하다.
그래서 최종적으로 오히려 단순하게 특정 영화의 유사도를 추가하고 정렬하였다(자기 자신 포함).

movies_df[['title','vote_average','vote_count']].sort_values('vote_average', ascending=False)[:10]

	title	vote_average	vote_count
3519	Stiff Upper Lips	10.0	1
4247	Me You and Five Bucks	10.0	2
4045	Dancer, Texas Pop. 81	10.0	1
4662	Little Big Top	10.0	1
3992	Sardaarji	9.5	2
2386	One Man's Hero	9.3	2
2970	There Goes My Baby	8.5	2
1881	The Shawshank Redemption	8.5	8205
2796	The Prisoner of Zenda	8.4	11
3337	The Godfather	8.4	5893

이번엔 평점순으로 영화를 정렬해보았다.
유명한 영화가 아님에도 평가 횟수가 적어 평점이 높은 경우가 있다.
평가 횟수와 평점을 모두 고려한 가중 평점 함수를 생성하자.

percentile = 0.6
m = movies_df['vote_count'].quantile(percentile)
C = movies_df['vote_average'].mean()

def weighted_vote_average(record):
    v = record['vote_count']
    R = record['vote_average']
    
    return ( (v/(v+m)) * R ) + ( (m/(m+v)) * C )   

movies_df['weighted_vote'] = movies_df.apply(weighted_vote_average, axis=1)

가중 평점은 IMDB에서 사용하는 방식을 사용한다.
IMDB에서 사용하는 지표로, 따로 공식화된 지표는 아닌 것 같아 수식은 표기하지 않는다.

temp = movies_df[['title','vote_average','vote_count','weighted_vote']]
temp.sort_values('weighted_vote', ascending=False)[:10]

	title	vote_average	vote_count	weighted_vote
1881	The Shawshank Redemption	8.5	8205	8.396052
3337	The Godfather	8.4	5893	8.263591
662	Fight Club	8.3	9413	8.216455
3232	Pulp Fiction	8.3	8428	8.207102
65	The Dark Knight	8.2	12002	8.136930
1818	Schindler's List	8.3	4329	8.126069
3865	Whiplash	8.3	4254	8.123248
809	Forrest Gump	8.2	7927	8.105954
2294	Spirited Away	8.3	3840	8.105867
2731	The Godfather: Part II	8.3	3338	8.079586

가중 평점을 기준으로 상위 10개의 영화를 출력하였다.
사람마다 성향은 다르겠지만 쇼생크 탈출, 대부 등 명작 영화가 보인다.
장르 유사도와 가중 평점을 모두 고려한 영화 추천 함수를 만들어 보자.

def find_sim_movie(df, sim_matrix, title_name, top_n=10):
    
    # 입력한 영화의 index
    title_movie = df[df['title'] == title_name]
    title_index = title_movie.index.values
    
    # 입력한 영화의 유사도 데이터 프레임 추가
    df["similarity"] = sim_matrix[title_index, :].reshape(-1,1)
        
    # 유사도와 가중 평점순으로 높은 상위 index 추출 (자기 자신 제거)
    temp = df.sort_values(by=["similarity", "weighted_vote"], ascending=False)
    temp = temp[temp.index.values != title_index]
    
    final_index = temp.index.values[:top_n]
    
    return df.iloc[final_index]

similar_movies = find_sim_movie(movies_df, genre_sim, 'The Godfather', 10)
similar_movies[['title', 'vote_average', "weighted_vote", "similarity"]]

	title	vote_average	weighted_vote	similarity
1881	The Shawshank Redemption	8.5	8.396052	1.0
2731	The Godfather: Part II	8.3	8.079586	1.0
1847	GoodFellas	8.2	7.976937	1.0
3866	City of God	8.1	7.759693	1.0
1663	Once Upon a Time in America	8.2	7.657811	1.0
3887	Trainspotting	7.8	7.591009	1.0
883	Catch Me If You Can	7.7	7.557097	1.0
892	Casino	7.8	7.423040	1.0
281	American Gangster	7.4	7.141396	1.0
4041	This Is England	7.4	6.739664	1.0

교재랑 다르게 유사도가 높으면서 가중 평점이 높은 경우 추출되게 만들었다.
그래도 이번엔 가중 평점도 고려한 덕에 교재랑 결과가 비슷하다.
앞서 언급하였듯이 유사도가 1인 경우가 너무 많아 단순 정렬 문제이다.
대부2, 원스 어폰 어 타임 인 아메리카 등이 추천 영화로 나타났다.

Share on

Twitter Facebook LinkedIn

Romg2

[Python] 머신러닝 완벽가이드 - 09. 추천 시스템[콘텐츠 기반]

1. 콘텐츠 기반 필터링

1.1 데이터 로딩 및 가공

1.2 장르 콘텐츠 유사도 측정

Share on

Leave a comment

You may also enjoy

[OPGG] 인턴 연계 과정 - 미니맵 챔피언 인식

[OPGG] 인턴 연계 과정 - 프로 리그 데이터 수집

[OPGG] 파이널 프로젝트 - 포지션 예측

[Python] 코딩 도장 - ASCII Art N