协同过滤概述

一直想写一篇系统介绍协同过滤的文章,最近在调查热评算法,又遇到了协同过滤相关的文章,决定系统的对协同过滤做调查研究。

定义

基于协同过滤的推荐算法,旨在充分利用集体智慧,即在大量人群的行为和数据中收集答案,以达到对整个人群得到统计意义上的结论。

特点

推荐的个性化程度高

出发点

  1. 兴趣相近的用户可能会对同样的东西感兴趣
  2. 用户可能比较偏爱与其已购买的东西相类似的商品

通常依据出发点分为两种:

  • 依据用户进行的协同过滤
  • 依据物品进行的协同过滤

冷启动问题

  • 新用户
  • 新物品
    • 通用解决方案,推荐榜单内容

计算的步骤

  1. 选择规模可控的维度
  2. 依据该维度计算相似度

鉴于现在的文章,理论介绍的多,实践介绍的少,因此这里就先不详述理论部分,直接针对代码进行介绍,在解释代码的过程中,展开对理论的描述,这样可能更生动一些。

代码实现 1: 基于电影数据集

本次实验的数据集来自明尼苏达大学的 Grouplens 研究小组整理的 MovieLens 。

该数据集包含20万用户对2万部电影的评级信息。 MovieLens 数据集地址

依赖库引入

import pandas as pd 
import numpy as np
import warnings
warnings.filterwarnings('ignore')

引入数据并做数据预处理

使用 read_csv() 加载

df = pd.read_csv('u.data', sep='\t', names=['user_id','item_id','rating','titmestamp'])

df.head()

读取电影的标题

movie_titles = pd.read_csv('Movie_Titles')
movie_titles.head()

将电影标题和其它信息合并

df = pd.merge(df, movie_titles, on='item_id')
df.head()

至此,可以看到已经拥有的数据格式如下:

  user_id item_id rating timestamp title
作用 用户ID 电影ID 评分 时间 电影名称

对各列信息做统计

df.describe()

对每部的评价进行统计

ratings = pd.DataFrame(df.groupby('title')['rating'].mean())
ratings.head()

再将评价个数作为新属性添加

ratings['number_of_ratings'] = df.groupby('title')['rating'].count()
ratings.head()

对 rating 、 number_of_ratings 属性的统计结果进行可视化展示

import matplotlib.pyplot as plt
%matplotlib inline
ratings['rating'].hist(bins=50)

ratings['number_of_ratings'].hist(bins=60)

import seaborn as sns
sns.jointplot(x='rating', y='number_of_ratings', data=ratings)

对现有的矩阵表进行锚点确定。

movie_matrix = df.pivot_table(index='user_id', columns='title', values='rating')
movie_matrix.head()

按照评分排序

ratings.sort_values('number_of_ratings', ascending=False).head(10)

对部分属性进行提取

AFO_user_rating = movie_matrix['Air Force One (1997)']
contact_user_rating = movie_matrix['Contact (1997)']

AFO_user_rating.head()
contact_user_rating.head()
similar_to_air_force_one=movie_matrix.corrwith(AFO_user_rating)
similar_to_air_force_one.head()
similar_to_contact = movie_matrix.corrwith(contact_user_rating)
similar_to_contact.head()
corr_contact = pd.DataFrame(similar_to_contact, columns=['Correlation'])
corr_contact.dropna(inplace=True)
corr_contact.head()
corr_AFO = pd.DataFrame(similar_to_air_force_one, columns=['correlation'])
corr_AFO.dropna(inplace=True)
corr_AFO.head()
corr_AFO = corr_AFO.join(ratings['number_of_ratings'])
corr_contact = corr_contact.join(ratings['number_of_ratings'])
corr_AFO .head()
corr_contact.head()
corr_AFO[corr_AFO['number_of_ratings'] > 100].sort_values(by='correlation', ascending=False).head(10)
corr_contact[corr_contact['number_of_ratings'] > 100].sort_values(by='Correlation', ascending=False).head(10)

代码实现 2:基于简易模拟矩阵

本实验的数据集是一个(6, 6)的模拟矩阵

M = np.asarray([[3,7,4,9,9,7],
                [7,0,5,3,8,8],
               [7,5,5,0,8,4],
               [5,6,8,5,9,8],
               [5,8,8,8,10,9],
               [7,7,0,4,7,8]])

M=pd.DataFrame(M)               

行代表:实例 列代表:涉及到的属性

依赖库引入

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
import numpy as np
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import correlation, cosine
import ipywidgets as widgets
from IPython.display import display, clear_output
from sklearn.metrics import pairwise_distances
from sklearn.metrics import mean_squared_error
from math import sqrt
import sys, os
from contextlib import contextmanager

findksimilarusers

初始化两个 list 数组

    similarities = []
    indices = []

调用 sklearn 中的 NearestNeighbors 方法计算 M(这里 M 即是 ratings) 的 k 个最临近的单元

    model_knn = NearestNeighbors(metric=metric, algorithm='brute')
    model_knn.fit(ratings)

选择 M 的第1行,并将该行变为二维矩阵,表达一个实例的全部属性

ratings.iloc[user_id - 1, :].values.reshape(1, -1)

输出

array([[3, 7, 4, 9, 9, 7]])

NearestNeighbors 模型计算 0 项与其它 ID 的距离,

    distances, indices = model_knn.kneighbors(ratings.iloc[user_id - 1, :].values.reshape(1, -1), n_neighbors=k)

并对结果排序,且给出对应的索引值。

1 - distances.flatten(): array([1.        , 0.97388994, 0.93462168, 0.88460046, 0.79926798,
       0.77922652])

indices: array([[0, 4, 3, 5, 1, 2]])

完整源码

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
import numpy as np
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import correlation, cosine
import ipywidgets as widgets
from IPython.display import display, clear_output
from sklearn.metrics import pairwise_distances
from sklearn.metrics import mean_squared_error
from math import sqrt
import sys, os
from contextlib import contextmanager


#M is user-item ratings matrix where ratings are integers from 1-10
M = np.asarray([[3,7,4,9,9,7],
                [7,0,5,3,8,8],
               [7,5,5,0,8,4],
               [5,6,8,5,9,8],
               [5,8,8,8,10,9],
               [7,7,0,4,7,8]])
M=pd.DataFrame(M)

#declaring k,metric as global which can be changed by the user later
global k,metric
k=4
metric='cosine' #can be changed to 'correlation' for Pearson correlation similaries


# This function finds k similar users given the user_id and ratings matrix M
# Note that the similarities are same as obtained via using pairwise_distances
def findksimilarusers(user_id, ratings, metric=metric, k=k):
    """
    :param user_id: 目标用户
    :param ratings: 用户的评分
    :param metric: 指标
    :param k: 返回几个相似的用户
    :return: similarities 返回相似值(第一个是自己,顺序递减),indices对应用户下标
    """
    similarities = []
    indices = []
    # 在sklearn中,NearestNeighbors方法可用于基于各种相似性度量搜索k个最近邻
    model_knn = NearestNeighbors(metric=metric, algorithm='brute')
    model_knn.fit(ratings)

    distances, indices = model_knn.kneighbors(ratings.iloc[user_id - 1, :].values.reshape(1, -1), n_neighbors=k)
    similarities = 1 - distances.flatten()
    print '{0} most similar users for User {1}:\n'.format(k - 1, user_id)
    for i in range(0, len(indices.flatten())):
        if indices.flatten()[i] + 1 == user_id:
            continue
        else:
            print '{0}: User {1}, with similarity of {2}'.format(i,
                indices.flatten()[i] + 1, similarities.flatten()[i])

    return similarities, indices


# This function predicts rating for specified user-item combination based on user-based approach
def predict_userbased(user_id, item_id, ratings, metric=metric, k=k):
    prediction = 0
    similarities, indices = findksimilarusers(user_id, ratings, metric, k)  # similar users based on cosine similarity
    mean_rating = ratings.loc[user_id - 1, :].mean()  # to adjust for zero based indexing
    sum_wt = np.sum(similarities) - 1
    product = 1
    wtd_sum = 0

    for i in range(0, len(indices.flatten())):
        if indices.flatten()[i] + 1 == user_id:
            continue
        else:
            ratings_diff = ratings.iloc[indices.flatten()[i], item_id - 1] - np.mean(
                ratings.iloc[indices.flatten()[i], :])
            product = ratings_diff * (similarities[i])
            wtd_sum = wtd_sum + product

    prediction = int(round(mean_rating + (wtd_sum / sum_wt)))
    print '\nPredicted rating for user {0} -> item {1}: (prediction value){2}'.format(user_id, item_id, prediction)

    return prediction

#This function finds k similar items given the item_id and ratings matrix M
def findksimilaritems(item_id, ratings, metric=metric, k=k):
    similarities=[]
    indices=[]
    ratings=ratings.T
    model_knn = NearestNeighbors(metric = metric, algorithm = 'brute')
    model_knn.fit(ratings)

    distances, indices = model_knn.kneighbors(ratings.iloc[item_id-1, :].values.reshape(1, -1), n_neighbors = k+1)
    similarities = 1-distances.flatten()
    print '{0} most similar items for item {1}:\n'.format(k,item_id)
    for i in range(0, len(indices.flatten())):
        if indices.flatten()[i]+1 == item_id:
            continue
        else:
            print '{0}: Item {1} :, with similarity of {2}'.format(i,indices.flatten()[i]+1, similarities.flatten()[i])

    return similarities,indices


# This function predicts the rating for specified user-item combination based on item-based approach
def predict_itembased(user_id, item_id, ratings, metric=metric, k=k):
    prediction = wtd_sum = 0
    similarities, indices = findksimilaritems(item_id, ratings)  # similar users based on correlation coefficients
    sum_wt = np.sum(similarities) - 1
    product = 1

    for i in range(0, len(indices.flatten())):
        if indices.flatten()[i] + 1 == item_id:
            continue
        else:
            product = ratings.iloc[user_id - 1, indices.flatten()[i]] * (similarities[i])
            wtd_sum = wtd_sum + product
    prediction = int(round(wtd_sum / sum_wt))
    print '\nPredicted rating for user {0} -> item {1}: {2}'.format(user_id, item_id, prediction)

    return prediction


if __name__ == '__main__':
    similarities, indices = findksimilarusers(1, M, metric='cosine')
    # 'correlation' for Pearson correlation similarities
    similarities, indices = findksimilarusers(1, M, metric='correlation')
    predict_userbased(3, 4, M)

    similarities, indices = findksimilaritems(3, M)
    prediction = predict_itembased(1, 3, M)

输出结果

3 most similar users for User 1:
1: User 5, with similarity of 0.973889935402
2: User 4, with similarity of 0.934621684178
3: User 6, with similarity of 0.88460045723

3 most similar users for User 1:
1: User 5, with similarity of 0.761904761905
2: User 6, with similarity of 0.277350098113
3: User 4, with similarity of 0.208179450927

3 most similar users for User 3:
1: User 4, with similarity of 0.90951268934
2: User 2, with similarity of 0.874744414849
3: User 5, with similarity of 0.86545387815

Predicted rating for user 3 -> item 4: (prediction value)3

4 most similar items for item 3:
1: Item 5 :, with similarity of 0.918336125535
2: Item 6 :, with similarity of 0.874759773038
3: Item 1 :, with similarity of 0.810364746222
4: Item 4 :, with similarity of 0.796917800302

4 most similar items for item 3:
1: Item 5 :, with similarity of 0.918336125535
2: Item 6 :, with similarity of 0.874759773038
3: Item 1 :, with similarity of 0.810364746222
4: Item 4 :, with similarity of 0.796917800302

Predicted rating for user 1 -> item 3: 7

现有推荐系统参考

  • 基于热度的推荐系统: 最简单,但是也是最不个性化的。典型的案例就是 bilibili 的日/周/月榜
  • 基于内容的推荐系统: 基于内容描述的推荐系统
  • 基于协同过滤的推荐系统: 基于用户相似度的推荐系统 或者 基于物品相似度的推荐系统
PREVIOUS机器学习预测过程实践