Course:CPSC522/Matrix Factorization For Recommender Systems

Matrix Factorization for Recommender Systems

Matrix Factorization (MF) is a highly useful approach newly utilized in modern recommender systems (such as Netflix's).

Here we will concentrate on how the first paper paved the way for the second paper to use the approaches and algorithms.

First paper: Major components of the Gravity Recommendation System^[1]

Second paper: Matrix Factorization Techniques for Recommender Systems^[2]

Principal Author: Fardad Homafar

Collaborators:

Abstract

Recommender systems are an essential part of today's online marketing websites. It is an open-ended world of new theories and methods after all. In this page, we will briefly explain some of the approaches used in modern recommender systems with an emphasis on matrix factorization. After that, we will see how those methods are practically implemented. The second paper suggests various modifications to the traditional approach to refine it. In the last part, those approaches were utilized on a dataset released by Netflix for a competition to find the best recommender system developers can think of. The results look fascinating.

Builds on

Recommender Systems can be broadly categorized into two classes. The first one is Content-based Filtering. In this method, explicit profiles are created for products and users to consider their attributes. Another class is Collaborative Filtering which uses past user behaviors. The latter is becoming more and more common among recommender systems. Collaborative Filtering itself consists of two distinct categories, Neighborhood-based methods and Latent Factor methods. Matrix Factorization which will be covered in more detail here is a method that is classified under Latent Factor methods.^[3]

Related Pages

As stated before, the focus of this page is on matrix factorization. However, the papers give some details on Neighborhood-based methods as well. A more mathematical-based fundamental approach to Matrix Factorization can be done too.^[4] The dataset considered for the case study of this page is for the Netflix Prize Competition.^[5]

Content

Introduction

Nowadays, customers have a long list of choices to purchase. They buy products from groceries and entertainment to electronic devices and residences. On the other side, we have retailers that need to focus on finding the best ways of selling their products. For instance, we can consider Amazon with that broad range of products. It looks wise to imagine there are lots of potential customers who have bought some of the products but they are capable of buying other products somehow related to purchased ones too. Thus, retailers need to find some ways to recommend items to the appropriate users with matching tastes. This is where recommender systems come into play. Using those systems, not only the retailers will benefit from an increase in their sales but also customer satisfaction will go up. This is a true story about Netflix company too. There are a bunch of movies in assorted genres that can appeal to some viewers of other movies. Considering the large number of items and customers, the response time of recommender systems and how accurate they are to users' preferences is of paramount importance here.Here we start with Matrix Factorization method. It is important to note that in both papers the products are the movies of Netflix that are intended to be recommended to the appropriate users.

Matrix Definitions

Those two papers propose different notations for the matrix factorization but they will eventually lead to the same factors. Both papers state that this way of factorizing matrices is closely related to singular value decomposition (SVD) algorithm. This method can be used to derive latent factors. We explain the method of the first paper.

X is the rating matrix with possible values of the integers from 1 to 5. Its dimension is $I\times J$ . In other words, $I$ is the number of customers and $J$ the number of movies. The members of X such as ${\textstyle x_{ij}}$ denote the rating of the jth movie provided by ith customer.

Here, we are trying to approximate X by a multiplication of two matrices U and M:

$X\approx UM$

U and M are considered to have dimensions of $I\times K$ and $K\times J$ respectively. The values of ${\textstyle u_{ik}}$ and ${\textstyle m_{kj}}$ can be considered the kth feature of the ith

user and the jth movie. Keep in mind that X contains only integers from 1 to 5 but U and M can contain real numbers. The training phase of our model is to find U and M and then we use the model to predict unseen data. The predicted X vector members are:

${{\hat {x}}_{ik}}=\sum _{k=1}^{K}u_{ik}m_{kj}$

Calculating Factors

As the second paper states, factorization in collaborative filtering has some difficulties because of some missing values caused by sparseness in X. In other words, not all of the users rate all of the movies on the Netflix channel. Moreover, if we consider only available values of X and train our machine learning model on them, it would definitely overfit on that data. Earlier systems utilized imputation to fill in missing ratings and make the rating matrix dense.^[6] Imputations can have some problems such as being computationally expensive. This is the place where the first paper's approach was to model only the available ratings and prevent overfitting by regularized modeling. We define the error:

$e_{ij}=x_{ij}-{\hat {x}}_{ij}$

Approach 1: Gradient Descent Method

To find the optimum point the first paper uses gradient descent method and optimizes the squared error.

$SE=\sum _{(i,j)\in {(I,J)}}e_{ij}^{2}\;\;,\;\;(U,M)=arg\;min\;SE$

By calculating gradient matrix we have:

${\frac {\partial }{\partial u_{ik}}}e_{ij}^{2}=-2e_{ij}m_{kj}\;,\;\;\;{\frac {\partial }{\partial m_{kj}}}e_{ij}^{2}=-2e_{ij}u_{ik}$

According to gradient descent we have to go in the opposite direction of the gradient matrix to minimize our objective. Thus, we update U and M matrices just like this:

$u'_{ik}=u_{ik}+\eta \cdot 2e_{ij}\cdot m_{kj}\;,\;\;\;\;m'_{kj}=m_{kj}+\eta \cdot 2e_{ij}\cdot u_{ik}$

η is the learning rate of the algorithm. As stated before, to reduce overfitting, we need some regularizations applied here with the parameter λ:

$u'_{ik}=u_{ik}+\eta \cdot (2e_{ij}\cdot m_{kj}-\lambda \cdot u_{ik})\;,\;\;\;\;m'_{kj}=m_{kj}+\eta \cdot (2e_{ij}\cdot u_{ik}-\lambda \cdot m_{kj})$

The second paper uses gradient descent method by considering the approach of the first paper, again in another notation.

Approach 2: Alternating Least Squares

This is a method that is utilized exclusively in the second paper. It states that although the gradient descent method is faster and easier to implement, Alternating Least squares (ALS) is a better option under two situations. The first one is when our system is capable of parallel computations and the second situation is when the systems are using implicit data.

To begin this approach we need to get familiar with the notation of the second paper too. This is the objective function of that paper:

$\min _{p^{*},q^{*}}\sum _{(u,i)\in \kappa }(r_{ui}-q_{i}^{T}p_{u})^{2}+\lambda (||q_{i}||^{2}+||p_{u}||^{2})$

Here, $r_{ui}$ is the actual rating of the user u to the item (movie) i. $q_{i}$ is a vector that stores features of the item i in it. Similarly, $p_{u}$ is a vector that stores features of the user u in it. $\lambda$ is a parameter to implement regularization as explained before.

Let's get back to ALS. To utilize this method, we fix each of the unknowns ( $q_{i}$ and $p_{u}$ ) in turn. To be more specific, when we fix one of the unknowns, the problem would be transformed into a quadratic optimization problem, which can be easily solved. Fixing all $q_{i}$ 's the algorithm recomputes all $p_{u}$ 's by solving a least-squares problem, and vice versa. This loop continues until convergence.

Modifying MF Calculations

To increase the accuracy of the model, the second paper begins by adding some modifications to the model. One advantage of matrix factorization method in collaborative filtering is that it can be modified to deal with various data aspects as well as unprecedented applications. From now on, all the information is based on the second paper's contribution to the topic.

Adding Biases

In our objective function, we consider the effect of user and product interactions to determine the rating value. However, much of the variations in ratings are due to either users or items and not their interactions. We call them biases or intercepts. It means that in general, some users tend to give higher ratings in comparison with others, and some movies are considered better than others and will get higher ratings, independent of the interactions between items and users. An approximation of this bias for the rating $r_{ui}$ is :

$b_{ui}=\mu +b_{i}+b_{u}$

In this formula, $\mu$ is the overall average rating. $b_{i}$ and $b_{u}$ are deviations of user u and item i from the average. Let's see an example. Imagine we have a user to rate the movie, John Wick. The average rating of the movies is 3.5. But, this movie is better than average. So, we can predict that it would get a 0.4 more rating score than the average. The user who is about to rate this movie is, contrarily, a critical one. As a result, he/she tends to give a rating of 0.3 below the average. As a result, the user's estimated rating for that movie would be $3.5+0.4-0.3=3.6$ . The prediction of the model for the rating of the user u to the movie i would be reformulated as follows:

${\hat {r}}_{ui}=\mu +b_{i}+b_{u}+q_{i}^{t}p_{u}$

Note that values of the biases are given to the model as inputs. As you can see above, the observed rating comprises four components in the new formulation: global average, item bias, user bias, and user-item interaction. The objective function for the new model would be:

$\min _{p^{*},q^{*},b^{*}}\sum _{(u,i)\in \kappa }(r_{ui}-\mu -b_{u}-b_{i}-p_{u}^{t}q_{i})^{2}+\lambda (||p_{u}||^{2}+||q_{i}||^{2}+b_{u}^{2}+b_{i}^{2})$

It is important to note that biases capture a large extent of the observed signal. So, accurate modeling of them is of great importance. There are some other works with more elaborate bias formulations.^[7]

Additional Input Sources

This method is utilized when our system deals with cold start problem, there are not enough ratings from the users available to reach general conclusions on their taste. To solve this problem we can use other sources of information. To gain insight into some users' preferences, as second paper says, recommender systems are able to utilize implicit feedback based on behavioral information of the users on the internet. Such as their purchase and browsing history. Under such a circumstance, even if there are not any explicit ratings available, retailers can make some inferences about users' tastes.

Let's consider a case of Boolean implicit feedback. N(u) stores the items that the user shows implicit preference toward them. Consequently, a new set of item factors is needed. Item i is associated with $x_{i}$ and the user who showed this tendency is identified by this vector:

$\sum _{i\in N(u)}x_{i}$

To normalize this value we have ^[8]:

$|N(u)|^{-0.5}\sum _{i\in N(u)}x_{i}$

As another information source, we can consider known user characteristics such as age group, gender, zip code, etc. If we consider Boolean set of attributes A(u) to store the attributes of user u, a factor vector $y_{a}$ corresponds to each attribute to describe a user. The final prediction model is:

${\hat {r}}_{ui}=\mu +b_{i}+b_{u}+q_{i}^{t}\left[p_{u}+|N(u)|^{-0.5}\sum _{i\in N(u)}x_{i}+\sum _{a\in A(u)}y_{a}\right]$

Temporal Dynamics

So far, the models we investigated were all static. Even after several modifications, we did not take into account that users taste may evolve and change over time. On the other hand, products' popularity would change as well as new choices come into being. MF can beautifully consider these effects. To formulate these facts we have to use biases that depend on time. moreover, the parameter $p_{u}$ is a function of time here too. This is because, for example, a user who is a fan of action movies may tend to dramas the next year. Noteworthy, items are static in nature and their features $q_{i}$ will not change over time. So the parameter $q_{i}$ is not a function of time. So we can rewrite the prediction using dynamic parameters:

${\hat {r}}_{ui}(t)=\mu +b_{i}(t)+b_{u}(t)+q_{i}^{t}p_{u}(t)$

Inputs with varying confidence levels

It is important to note that all the observations are not from the same level of importance and effect. Consider a case where advertising can affect the rating of a movie in the short term. Nonetheless, the rating would reduce in the long run. Or about the users, there might be some adversarial users that try to deliberately reduce ratings of some movies by giving them low rating scores. To solve this problem, we can take into account some other information such as the frequency of certain actions. For example, how much time a user spends watching a show. Apparently, one-time events are not that trustworthy. But, the recurring ones, can affect the results. So for each value of $r_{ui}$ , we define a confidence value $c_{ui}$ to consider these effects. It can be of different scales and with different factors affecting confidence value. The recommender system decides what value to be assigned to each $r_{ui}$ . We can rewrite the objective function like this:

$\min _{p^{*},q^{*},b^{*}}\sum _{(u,i)\in \kappa }c_{ui}(r_{ui}-\mu -b_{u}-b_{i}-p_{u}^{t}q_{i})^{2}+\lambda (||p_{u}||^{2}+||q_{i}||^{2}+b_{u}^{2}+b_{i}^{2})$

Implementing modified models on Netflix Dataset

In this plot (from the second paper), we can see the root mean squared error (RMSE) of five different models that we discussed on this page. As you can see there are two variants of adding temporal components to the method which is plotted in this figure. The accuracy improves with the number of parameters (and the factor model's dimensionality which is denoted by numbers on the charts) increases. Furthermore, we can see that these modifications that we covered in this lecture (that were completely attributed to the second paper) substantially decreased our model's error. It looks interesting that the Netflix system achieves RMSE = 0.9514 on this dataset, although the grand prize’s required error was RMSE = 0.8563.

Conclusion

On this page, we started by introducing recommender systems and different categories of those systems. We talked about Netflix competition dataset as well. After that, we dived deep into Matrix Factorization explanation. Then, we continued to compute those factors using the base method of gradient descent, which was covered in the first paper. After that, the second paper utilized a new approach to compute those factors too. To increase the accuracy of MF on recommender systems, the second paper suggests some modifications to the model which were described in detail. Finally, we discovered that due to the contribution of the second paper's authors, the plain model that was suggested by the first paper greatly improved in terms of accuracy.

One of the things that has not been covered in those papers was how time-consuming those methods would be to train on our dataset. Using deep learning algorithms for training our models would potentially decrease the error and make the models more time-efficient.

Annotated Bibliography

↑ Takács, G., Pilászy, I., Németh, B., & Tikk, D. (2007). Major components of the gravity recommendation system. Acm Sigkdd Explorations Newsletter, 9(2), 80-83.
↑ Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30-37.
↑ Fang, J., Zhang, X., Hu, Y., Xu, Y., Yang, M., & Liu, J. (2021, January 10). Probabilistic latent factor model for collaborative filtering with bayesian inference. 2020 25th International Conference on Pattern Recognition (ICPR). http://dx.doi.org/10.1109/icpr48806.2021.9412376
↑ Mnih, A., & Salakhutdinov, R. R. (2007). Probabilistic matrix factorization. Advances in neural information processing systems, 20.
↑ Bennett, J., & Lanning, S. (2007, August). The netflix prize. In Proceedings of KDD cup and workshop (Vol. 2007, p. 35).
↑ Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. T. (2000). Application of dimensionality reduction in recommender system-a case study.
↑ Koren, Y. (2009, June). Collaborative filtering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 447-456).
↑ Koren, Y. (2008, August). Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 426-434).

To Add

Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.

Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0

[1] Takács, G., Pilászy, I., Németh, B., & Tikk, D. (2007). Major components of the gravity recommendation system. Acm Sigkdd Explorations Newsletter, 9(2), 80-83.

[2] Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30-37.

[3] Fang, J., Zhang, X., Hu, Y., Xu, Y., Yang, M., & Liu, J. (2021, January 10). Probabilistic latent factor model for collaborative filtering with bayesian inference. 2020 25th International Conference on Pattern Recognition (ICPR). http://dx.doi.org/10.1109/icpr48806.2021.9412376

[4] Mnih, A., & Salakhutdinov, R. R. (2007). Probabilistic matrix factorization. Advances in neural information processing systems, 20.

[5] Bennett, J., & Lanning, S. (2007, August). The netflix prize. In Proceedings of KDD cup and workshop (Vol. 2007, p. 35).

[6] Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. T. (2000). Application of dimensionality reduction in recommender system-a case study.

[7] Koren, Y. (2009, June). Collaborative filtering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 447-456).

[8] Koren, Y. (2008, August). Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 426-434).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]