Latent feature based Collaborative filtering
I would like to optimize the latent based collaborative filtering technique to predict user rating for a movie using data sets available from movielens.
Principal Author: Tanuj Kr Aasawat
Most of the background knowledge is available here.
Modern consumers are inundated with choices. Electronic retailers and content providers offer a huge selection of products, with unprecedented opportunities to meet a variety of special needs and tastes. Matching consumers with the most appropriate products is key to enhancing user satisfaction and loyalty. Therefore, more retailers have become interested in recommender systems, which analyze patterns of user interest in products to provide personalized recommendations that suit a user’s taste. Because good personalized recommendations can add another dimension to the user experience, e-commerce leaders like Amazon.com and Netflix have made recommender systems a salient part of their websites. 
Broadly speaking, recommender systems are based on one of two strategies: Content filtering and Collaborative filtering.
The content filtering approach creates a profile for each user or product to characterize its nature. For example, a movie profile could include attributes regarding its genre, the participating actors, its box office popularity, and so forth. User profiles might include demographic information or answers provided on a suitable questionnaire. The profiles allow programs to associate users with matching products. A known successful realization of content filtering is the Music Genome Project, which is used for the Internet radio service Pandora.com.
Collaborative filtering analyzes relationships between users and inter-dependencies among products to identify new user-item associations. A major appeal of collaborative filtering is that it is domain free, yet it can address data aspects that are often elusive and difficult to profile using content filtering. While generally more accurate than content-based techniques, collaborative filtering suffers from what is called the cold
start problem, due to its inability to address the system’s new products and users. In this aspect, content filtering is superior.
The two primary areas of collaborative filtering are the neighborhood methods and latent factor models. Neighborhood methods are centered on computing the relationships between items or, alternatively, between users. The item-oriented approach evaluates a user’s preference for an item based on ratings of “neighboring” items by the same user. A product’s neighbors are other products that tend to get similar ratings when rated by the same user. 
Latent factor models are an alternative approach that tries to explain the ratings by characterizing both items and users on, say, 20 to 100 factors inferred from the ratings patterns. This paper goes more in-depth into Latent factor models.
Some of the most successful realizations of latent factor models are based on matrix factorization. In its basic form, matrix factorization characterizes both items and users by vectors of factors inferred from item rating patterns. High correspondence between item and user factors leads to a recommendation. These methods have become popular in recent years by combining good scalability with predictive accuracy. Matrix factorization models map both users and items to a joint latent factor space of dimensionality , such that user-item interactions are modeled as inner products in that space. Accordingly, each item is associated with a vector and each user is associated with a vector . For a given item , the elements of measure the extent to which the item possesses those factors, positive or negative. For a given user , the elements of measure the extent of interest the user has in items that are high on the corresponding factors, again, positive or negative. Suppose is a dataset of tuples, where means user gave item a rating of . Let be the predicted rating of user on item . The aim is to optimize the sum-of-squares error. User and item offsets are also considered as some users might give higher ratings than other users, and some movies may have higher ratings than other movies.
So, predicted rating = where item has a parameter and user has a parameter , and is the average rating. and are used to minimize the sum of squares error. If there are users and items, there are parameters to tune. Finding the best parameters is an optimization problem that can be done with a method like gradient descent. Optimizing the and parameters can help get better estimates of the ratings. To make more accurate predictions, few underlying property of users and movies, like age of users, age-appropriateness of movies etc., are also considered. These are put into a hidden feature and tuned to fit data. The hidden feature has a value for item and a value for user . The product of these is used to predict the rating of that item for that user:
If and are both positive or both negative, the feature will increase the prediction. If one or is negative and the other is positive, the feature will decrease the prediction.
For such hidden features, we will have a value for every item and feature , and a value for every feature and user . For each user and item the contributions of the features are added. This gives the prediction:
To avoid overfitting, a regularization term is added to prevent the parameters from growing too much and overfitting the given data. Here an L2 regularizer for each of the parameters is added to
the optimization. The goal is to choose the parameters to minimize:
where is the regularization parameter.
The problem in the above equation is that it's ambiguous. Many researchers/papers includes the regularization factor with the summation which is actually not the correct way. The correct way is:
In this page, I'll demonstrate about accuracy of the correct way, specifically, I'll demonstrate how much accuracy is achieved for both the models after 100 iterations and for different values of regularization factor.
I used and modified the code, available at , provided by the instructor. The modified method(with comments) is as follows:
def learn(self, num_iter = 50, trace = 1): for i in range(num_iter): #learn carries out num_iter steps of gradient descent. self.iter += 1 total_error=0 sumsq_error=0 #sum of square error for user in range(self.max_user_index): #subtracting the regularization term for user offset; this should be done only once unlike many times in previous implementation self.user_offset[user] -= (self.step_size*self.regularization*self.user_offset[user]) for item in range(self.max_item_index): #subtracting the regularization term for item offset; this should be done only once unlike many times in previous implementation self.item_offset[item] -= (self.step_size*self.regularization*self.item_offset[item]) for user in range(self.max_user_index): for f in range(self.num_features): #subtracting the regularization term for user feature; this should be done only once unlike many times in previous implementation self.user_feature[user][f] -= (self.step_size*self.regularization * self.user_feature[user][f]) for item in range(self.max_item_index): for f in range(self.num_features): #subtracting the regularization term for item feature; this should be done only once unlike many times in previous implementation self.item_feature[item][f] -= (self.step_size*self.regularization * self.item_feature[item][f]) for (user,item,rating,timestamp) in self.ratings: #predicting the rating prediction = (self.ave_rating #predicted value + self.user_offset[user] + self.item_offset[item] + sum([self.user_feature[user][f]*self.item_feature[item][f] for f in range(self.num_features)])) error = prediction - rating total_error += abs(error) sumsq_error += error * error #sum square error self.user_offset[user] += (-self.step_size*error) #do gradient decent over self.item_offset[item] += (-self.step_size*error) #do gradient decent over for f in range(self.num_features): user_f = self.user_feature[user][f] item_f = self.item_feature[item][f] self.user_feature[user][f] += (-self.step_size*error*item_f) self.item_feature[item][f] += (-self.step_size*error*user_f) if trace>0: print("Ave sq error = ",sumsq_error/self.num_ratings)
I used the data set from Movielens (http://grouplens.org/datasets/movielens/). Movielens is a movie recommendation system that acquires movie ratings from users, and uses the rating to make predictions of other movies a user might like. The rating is from 1 to 5 stars, where 5 stars is better. The rating are a set of (user, item, rating, timestamp) tuples. Adjoining figure contains sample dataset from Movielens. Here each user is given a unique number, each item (movie) is given a unique number, and the timestamp. MovieLens dataset were collected over various periods of time and have four types of datasets: 100K, 1M, 10M, and 20M. To prove my hypothesis, I've used only the first dataset because to prove the hypothesis I don't need very large datasets.
Adjoining figure demonstrates the accuracy of my implementation w.r.t the older version, that many of the research papers have used. The sum-of-square error is the minimal for every value of regularization factor/learning rate.
I would like to extend it as a conference publication with more results as well as incorporating the technique in our rating prediction algorithm with different data-sets like from IMDB.
Page 187 of code.pdf in http://artint.info/code/python/aifca_python.zip
Matrix Factorization Techniques for Recommender Systems, Yahoo Research, AT&T Research