Course:CPSC522/Collaborative Filtering

Title

Latent feature based Collaborative filtering

I would like to optimize the latent based collaborative filtering technique to predict user rating for a movie using data sets available from movielens.
Principal Author: Tanuj Kr Aasawat

Most of the background knowledge is available here.

Content

Introduction

Modern consumers are inundated with choices. Electronic retailers and content providers offer a huge selection of products, with unprecedented opportunities to meet a variety of special needs and tastes. Matching consumers with the most appropriate products is key to enhancing user satisfaction and loyalty. Therefore, more retailers have become interested in recommender systems, which analyze patterns of user interest in products to provide personalized recommendations that suit a user’s taste. Because good personalized recommendations can add another dimension to the user experience, e-commerce leaders like Amazon.com and Netflix have made recommender systems a salient part of their websites. [1]
Broadly speaking, recommender systems are based on one of two strategies: Content filtering and Collaborative filtering.

Content Filtering

The content filtering approach creates a profile for each user or product to characterize its nature. For example, a movie profile could include attributes regarding its genre, the participating actors, its box office popularity, and so forth. User profiles might include demographic information or answers provided on a suitable questionnaire. The profiles allow programs to associate users with matching products. A known successful realization of content filtering is the Music Genome Project, which is used for the Internet radio service Pandora.com.[1]

Collaborative Filtering

Collaborative filtering analyzes relationships between users and inter-dependencies among products to identify new user-item associations. A major appeal of collaborative filtering is that it is domain free, yet it can address data aspects that are often elusive and difficult to profile using content filtering. While generally more accurate than content-based techniques, collaborative filtering suffers from what is called the cold start problem, due to its inability to address the system’s new products and users. In this aspect, content filtering is superior.[1]
The two primary areas of collaborative filtering are the neighborhood methods and latent factor models. Neighborhood methods are centered on computing the relationships between items or, alternatively, between users. The item-oriented approach evaluates a user’s preference for an item based on ratings of “neighboring” items by the same user. A product’s neighbors are other products that tend to get similar ratings when rated by the same user. [1]
Latent factor models are an alternative approach that tries to explain the ratings by characterizing both items and users on, say, 20 to 100 factors inferred from the ratings patterns. This paper goes more in-depth into Latent factor models.

Methodology

Some of the most successful realizations of latent factor models are based on matrix factorization. In its basic form, matrix factorization characterizes both items and users by vectors of factors inferred from item rating patterns. High correspondence between item and user factors leads to a recommendation. These methods have become popular in recent years by combining good scalability with predictive accuracy. Matrix factorization models map both users and items to a joint latent factor space of dimensionality $f$ , such that user-item interactions are modeled as inner products in that space. Accordingly, each item $i$ is associated with a vector $q_{i}$ and each user $u$ is associated with a vector $p_{u}$ . For a given item $i$ , the elements of $q_{i}$ measure the extent to which the item possesses those factors, positive or negative. For a given user $u$ , the elements of $p_{u}$ measure the extent of interest the user has in items that are high on the corresponding factors, again, positive or negative. Suppose $D$ is a dataset of $<u,i,r>$ tuples, where $<u,i,r>$ means user $u$ gave item $i$ a rating of $r$ . Let $r_{ui}^{'}$ be the predicted rating of user $u$ on item $i$ . The aim is to optimize the sum-of-squares error. User and item offsets are also considered as some users might give higher ratings than other users, and some movies may have higher ratings than other movies.

So, predicted rating = $r_{ui}^{'}=r^{'}+b_{i}+c_{u}$ where item $i$ has a parameter $b_{i}$ and user $u$ has a parameter $c_{u}$ , and $r^{'}$ is the average rating. $b_{i}$ and $c_{u}$ are used to minimize the sum of squares error. If there are $n$ users and $m$ items, there are $n+m$ parameters to tune. Finding the best parameters is an optimization problem that can be done with a method like gradient descent. Optimizing the $b_{i}$ and $c_{u}$ parameters can help get better estimates of the ratings. To make more accurate predictions, few underlying property of users and movies, like age of users, age-appropriateness of movies etc., are also considered. These are put into a hidden feature and tuned to fit data. The hidden feature has a value $f_{i}$ for item $i$ and a value $g_{u}$ for user $u$ . The product of these is used to predict the rating of that item for that user:

$r_{ui}^{'}=r^{'}+b_{i}+c_{u}+f_{i}g_{u}$

If $f_{i}$ and $g_{u}$ are both positive or both negative, the feature will increase the prediction. If one $f_{i}$ or $g_{u}$ is negative and the other is positive, the feature will decrease the prediction. For $k$ such hidden features, we will have a value $f_{ik}$ for every item $i$ and feature $k$ , and a value $g_{ku}$ for every feature $k$ and user $u$ . For each user $u$ and item $i$ the contributions of the features are added. This gives the prediction:

$r_{ui}^{'}=r^{'}+b_{i}+c_{u}+\sum _{\overline {s}}f_{ik}g_{ku}$

To avoid overfitting, a regularization term is added to prevent the parameters from growing too much and overfitting the given data. Here an L2 regularizer for each of the parameters is added to the optimization. The goal is to choose the parameters to minimize:

$\sum _{<u,i,r>\in D}(r^{'}+b_{i}+c_{u}+(\sum _{\overline {s}}f_{ik}g_{ku})-r)^{2}+\lambda (b_{i}^{2}+c_{u}^{2}+\sum _{k}(f_{ik}^{2}+g_{ku}^{2}))$

where $\lambda$ is the regularization parameter.

The problem in the above equation is that it's ambiguous. Many researchers/papers includes the regularization factor with the summation which is actually not the correct way. The correct way is:

$(\sum _{<u,i,r>\in D}(r^{'}+b_{i}+c_{u}+(\sum _{\overline {s}}f_{ik}g_{ku})-r)^{2}))+(\lambda (b_{i}^{2}+c_{u}^{2}+\sum _{k}(f_{ik}^{2}+g_{ku}^{2})))$

In this page, I'll demonstrate about accuracy of the correct way, specifically, I'll demonstrate how much accuracy is achieved for both the models after 100 iterations and for different values of regularization factor.

Code

I used and modified the code, available at [1], provided by the instructor. The modified method(with comments) is as follows:

def learn(self, num_iter = 50, trace = 1):
   for i in range(num_iter): #learn carries out num_iter steps of gradient descent.
       self.iter += 1
       total_error=0
       sumsq_error=0 #sum of square error
       for user in range(self.max_user_index): #subtracting the regularization term for user offset; this should be done only once unlike many times in previous implementation
           self.user_offset[user] -= (self.step_size*self.regularization*self.user_offset[user])
       for item in range(self.max_item_index): #subtracting the regularization term for item offset; this should be done only once unlike many times in previous implementation
           self.item_offset[item] -= (self.step_size*self.regularization*self.item_offset[item])
       for user in range(self.max_user_index): 
           for f in range(self.num_features): #subtracting the regularization term for user feature; this should be done only once unlike many times in previous implementation 
               self.user_feature[user][f] -= (self.step_size*self.regularization * self.user_feature[user][f])
       for item in range(self.max_item_index):
           for f in range(self.num_features): #subtracting the regularization term for item feature; this should be done only once unlike many times in previous implementation
               self.item_feature[item][f] -= (self.step_size*self.regularization * self.item_feature[item][f])
       for (user,item,rating,timestamp) in self.ratings: #predicting the rating
           prediction = (self.ave_rating #predicted value
                         + self.user_offset[user]
                         + self.item_offset[item]
                         + sum([self.user_feature[user][f]*self.item_feature[item][f]
                                for f in range(self.num_features)]))
           error = prediction - rating
           total_error += abs(error)
           sumsq_error += error * error #sum square error
           self.user_offset[user] += (-self.step_size*error) #do gradient decent over  $b_{i}$ 
           self.item_offset[item] += (-self.step_size*error) #do gradient decent over  $c_{u}$ 
           for f in range(self.num_features):
               user_f = self.user_feature[user][f]
               item_f = self.item_feature[item][f]
               self.user_feature[user][f] += (-self.step_size*error*item_f)
               self.item_feature[item][f] += (-self.step_size*error*user_f)
       if trace>0:
           print("Ave sq error = ",sumsq_error/self.num_ratings)

Sample Movielens Dataset

Dataset

I used the data set from Movielens (http://grouplens.org/datasets/movielens/). Movielens is a movie recommendation system that acquires movie ratings from users, and uses the rating to make predictions of other movies a user might like. The rating is from 1 to 5 stars, where 5 stars is better. The rating are a set of (user, item, rating, timestamp) tuples. Adjoining figure contains sample dataset from Movielens. Here each user is given a unique number, each item (movie) is given a unique number, and the timestamp. MovieLens dataset were collected over various periods of time and have four types of datasets: 100K, 1M, 10M, and 20M. To prove my hypothesis, I've used only the first dataset because to prove the hypothesis I don't need very large datasets.

Evaluation: Accuracy gain over older version

Evaluation

Adjoining figure demonstrates the accuracy of my implementation w.r.t the older version, that many of the research papers have used. The sum-of-square error is the minimal for every value of regularization factor/learning rate.

Future Work

I would like to extend it as a conference publication with more results as well as incorporating the technique in our rating prediction algorithm with different data-sets like from IMDB.

Annotated Bibliography

[1]Page 187 of code.pdf in http://artint.info/code/python/aifca_python.zip
[2]Matrix Factorization Techniques for Recommender Systems, Yahoo Research, AT&T Research