Restricted Boltzmann Machines for Collaborative Filtering

In this page, we introduce two methods for collaborative filtering: Singular Vector Decomposition (SVD) ^[1] and Restricted Boltzmann Machines (RBM).^[2]

Principal Author: Nam Hee Gordon Kim

Abstract

As the number of users and items grow larger, recommender systems are in need of scalable algorithms for collaborative filtering, in which a user's preference over many items is estimated based on other similar users. Latent factor methods such as singular vector decomposition (SVD) have proven feasible in such large scale settings. An alternative method based on Restricted Boltzmann Machines (RBM) is explored by Salakhutdinov et al.^[2], suggesting the feasibility of unsupervised graphical models for collaborative filtering. SVD and RBM are similar in their approach to estimate latent factors underlying the observed ratings between users and items, but different in how the latent factors are modeled. A variant of RBM called conditional factored RBM has a performance comparable to SVD, while the sources of error between SVD and RBM are different^[2].

Builds on

The reader should be familiar with the Collaborative Filtering problem. Collaborative filtering can be framed as a special case of imputation with not-missing-at-random (NMAR) data, which is outlined in Treatment of Missing Data. Early approaches to collaborative filtering framed the problem as an instance of either supervised learning (e.g. K-Nearest Neighbors) or unsupervised learning (e.g. K-Means Clustering). In this page, the main approach to collaborative filtering is supervised learning via matrix factorization and latent factors. Collaborative filtering is a crucial technology in recommender systems. Restricted Boltzmann Machines are an instance of probabilistic graphical models.

Related Pages

TBA

Content

Motivation

With the evolution of the Web and large-scale service Web platforms like Netflix, Amazon, and YouTube, it has become critical for recommender systems to adapt to the ever-growing scale of operation. Recommender systems produce recommendations in one of the two ways: (1) based on similarities between items, called content-based filtering, or (2) based on similarities between users, called collaborative filtering. In this page, we solely focus on collaborative filtering. Sarwar et al. describe collaborative filtering as a "successful recommender system technology... which works by matching customer preferences to other customers in making recommendations."^[3] One of the earliest implementations of collaborative filtering is Tapestry^[4], which "relied on the opinions of people from a close-knit community, such as an office workgroup."^[3] A suitable recommender system must perform collaborative filtering efficiently and accurately, while accommodating for the large (and increasing) number of customers and items as well as the complex patterns in the relationship among them.

One such method is based on singular value decomposition (SVD). SVD is an instance of latent factor analysis, which produces low-dimensional representations of the original data and uses them to predict the affinity between a customer and an item. In this page, a particular SVD technique called Maximum-Margin Matrix Factorization^[1], as proposed by Srebro et al., is explored. Salakhutdinov et al., on the other hand, present an alternative method based on Restricted Boltzmann Machines (RBM), an instance of probabilistic graphical models.^[2] In this page, we will explore SVD and RBM as two instances of collaborative filtering methods, discuss the details of each method, and compare the experimental results.

Preliminaries

Notations

For simplicity's sake, we use the Netflix recommendation challenge scenario as a framework for discussing recommender systems and collaborative filtering. We will use the notations similar to those in Kevin Murphy's textbook:

$X$ is the $n\times d$ user-item matrix where $n$ is the number of users on Netflix and $d$ is the number of movies.
An entry $x_{ij}$ $x_{ij}$ (row $i$ $i$ , column $j$ $j$ ) corresponds to user $i$ $i$ 's rating for movie $j$ $j$ . For simplicity's sake, let $x_{ij}\in {0,1,2,3,4,5}\quad \forall i,j$ $x_{ij}\in {0,1,2,3,4,5}\quad \forall i,j$ , where $0$ $0$ represents missing rating.
- For example, if $x_{ij}=3$ , then user $i$ has given movie $j$ the rating of 3 out of 5.
- If $x_{ij}=0$ , the user has not rated the movie $j$ .
- It follows that the rows of $X$ encode each user's preference over all movies and the columns of $X$ encode each item's ratings received by all users

Problem Definitions

Collaborative filtering naturally formulates a supervised learning problem, where the missing ratings in $X$ are responses to be predicted from the available data. For an arbitrary user $i$ , we are interested in predicting the user's rating on an unrated, unseen movie $q$ . More formally, we define prediction and inference in the collaborative filtering context as follows:

Prediction: Given the observed ratings $X$ , predict $x_{iq}\in \{1,2,3,4,5\}$ , a rating for a user $i$ a new query movie $q\in \{1,2,\cdots ,d\}$ .

Inference: In probabilistic terms, the prediction task defined above is a consequence of computing the probability $p(x_{iq}=k\ |\ X_{obs}),\quad k={1,2,3,4,5}$ where $X_{obs}$ denotes the non-zero entries of $X$ .

Singular Value Decomposition

Singular value decomposition (SVD) is a matrix factorization method that produces singular vectors.

$M=UDV'$

While SVD has been mainly an unsupervised method and used for dimensionality reduction for recommender systems' data^[3], a supervised variant of SVD, proposed by Goldman et al.^[4], supply an objective function to produce a matrix whose low-rank factorization best estimates the original data. In this method, the singular value matrix $D$ is ignored. The description of the following SVD method, often called the low-rank model, is borrowed from Salakhutdinov et al.^[2]

Let $M=UV'$ be the low-rank approximation to the partially observed target matrix $X$ . Then the objective function $f$ is defined by:

$f(U,V)=\sum _{i=1}^{n}\sum _{j=1}^{d}I_{ij}(u_{i}v_{j}'-X_{ij})^{2}$

Where $I_{ij}$ is the indicator function which is 1 if user $i$ rated movie $j$ . Srebro et al.^[1] proposes regularization on the Frobeus norm of $U$ and $V$ .

$f(U,V)=\sum _{i=1}^{n}\sum _{j=1}^{d}I_{ij}(u_{i}v_{j}'-X_{ij})^{2}+\lambda \sum _{i=1}^{n}\sum _{j=1}^{d}I_{ij}(||u_{i}||_{F}^{2}+||v_{j}||_{F}^{2})$

Then the optimal $U$ and $V$ are such that the objective function is minimized, and in so doing they define the optimal $M$ , whose entry $m_{iq}$ corresponds to the prediction for $x_{iq}$ , which is missing in the original matrix $X$ .

Restricted Boltzmann Machines

Example RBM architecture as proposed by Salakhutdinov et al. Each user has his/her own graph structure, which depends on the ratings for which movies are missing. The parameters are tied across all graph structures, thereby generating one set of parameters as a result of training.

Restricted Boltzmann Machines (RBM) are multilayered undirected graphical models with binary latent factors. Salakhutdinov et al.^[2] frames the task of computing $p(x_{iq}=k\ |\ X_{obs})$ as inference on an underlying RBM with trained parameters. To formulate this RBM framework, the dataset is subdivided into rating matrices, where a user's ratings are one-hot encoded into a matrix $V$ such that $v_{j}^{k}=1$ if the user rates movie $j$ with rating $k$ . The RBM graph is defined as follows:

$V$ is a $5\times d$ matrix corresponding to one-hot encoded integer ratings of the user, as defined above.
$h$ is a $F\times 1$ vector of binary hidden variables. Let's treat the number of hidden variables $F$ as a hyperparameter.
$W$ is a $d\times F\times 5$ tensor encoding adjacency between ratings and hidden features. Its entry $W_{jc}^{k}$ corresponds to the edge potential between rating $k$ of the movie $j$ and the hidden feature $c$ .

Then the whole user-item matrix becomes a collection of $V$ s, each corresponding to each user's ratings. Because each user can have different missing values, a unique RBM graph is defined for each user. In each RBM graph, edges connect ratings and hidden features but do not appear between movies of missing ratings. Salakhutdinov et al.^[2] treats $W$ as a set of edge potentials that are tied across all such RBM graphs.

As alluded above, $V$ is a collection of binary variables $v_{j}^{k}$ for $j\in \{1,2,\cdots ,d\}$ and $k\in {1,2,3,4,5}$ , and each binary variable is connected to every hidden feature in $h$ . RBM characterizes the relationship between the ratings and hidden features using conditional probabilities $p(v_{j}^{k}=1\ |\ h)$ and $p(h_{c}=1\ |\ V)$ , which are defined as follows:

$p(v_{j}^{k}=1\ |\ h)={\frac {\exp(b_{j}^{k}+\sum _{c=1}^{F}h_{c}W_{jc}^{k})}{\sum _{k'=1}^{5}\exp(b_{j}^{k'}+\sum _{c=1}^{F}h_{c}W_{jc}^{k'})}}$
$p(h_{c}=1\ |\ V)=\sigma \left(b_{c}+\sum _{j=1}^{d}\sum _{k=1}^{5}v_{j}^{k}W_{jc}^{k}\right)$

Note that the above conditional probabilities are functions of $W$ , the edge potentials in the RBM. Having populated the parameters in $W$ , $p(v_{q}^{k}=1\ |\ V)$ can be computed in two steps:

Compute the distribution of each hidden feature in $h$ based on observed ratings $V$ and the edge potentials $W$ , i.e. $p(h_{c}=1\ |\ V)$ for each $c$ .
Compute $p(v_{q}^{k}=1\ |\ h)$ based on the edge potentials $W$ and the distribution of $p(h_{c}=1\ |\ V)$ .

Optimization

Now, we outline the optimization for RBM. The derivations and lower-level details of optimization is found in Salakhutdinov et al.^[2]. $W$ is optimized by the marginal likelihood of $V$ , i.e. $p(V)$ . The gradient $\nabla W_{ij}^{k}$ is computed using contrastive divergence, which is an approximation of the gradient based on Gibbs sampling^[5].

Conditional Factored RBM

Salakhutdinov et al.^[2] further introduces conditional factoring for RBM, an improvement upon the basic RBM for collaborative filtering. The vanilla RBM involves a huge number of parameters in $W$ and is therefore prone to serious overfitting^[2]. Salakhutdinov et al. addresses this problem by factorizing $W$ into lower-rank matrices $A$ and $B$ , such that

$W_{ij}^{k}=\sum _{c=1}^{C}A_{ic}^{k}B_{cj}$

Performance of the conditional factored RBM vs. SVD with C=40 factors. The y-axis displays RMSE (root mean squared error), and the x-axis shows the number of epochs, or passes through the entire training dataset.

where $C$ is the rank of the two matrices. In this technique, the parameters of $W$ are approximated by the parameters in $A$ and $B$ , whose number is determined by the choice of $C$ . $A$ and $B$ are optimized with gradients $\nabla A_{i_{c}}^{k}$ and $\nabla B_{cj}$ which are also computed using contrastive divergence^[2]. This method effectively reduces the number of free parameters and also makes the gradient methods converge faster than the regular RBM.

Results

Salakhutdinov et al.^[2] includes the results on the 2005 Netflix dataset. Both SVD and RBM (i.e. conditional factored RBM) are trained on approximately 100 million movie ratings and tested on about 3 million ratings. The conditional factored RBM slightly outperforms SVD, although the patterns of prediction error between the two methods are significantly different.

Conclusion

In this page, we examined the SVD approach and the RBM approach to collaborative filtering. The contribution of Srebro et al.^[1] is the Frobenius norm regularization on the factorization of the user-item matrix $X$ , while the contribution of Salakhutdinov et al.^[2] is the introduction of an alternative, undirected graphical model approach to collaborative filtering. The juxtaposition between the two are summarized in the following bullet points:

Both SVD and RBM rely on hidden features and low-rank approximations of sparsely available data. However, the manner of constructing the hidden features is different between the two methods.
SVD aims to find the optimal factorization of $X$ based on available observations. RBM aims to learn the edge potentials of the undirected graph. SVD relies on the resolution of the hidden features in reconstructing the available data using dot products, while RBM has primarily probabilistic interpretation as to how the hidden features are constructed.
Conditional factored RBM performs slightly better than SVD on the Netflix dataset.

My Thoughts About Contributions

Given that SVD and RBM have similar performances, I think Salakhutdinov et al.^[2]'s real contribution is the novelty in its application rather than its performance. Salakhutdinov et al.^[2] furthermore states that the sources of error in RBM are different from those of SVD, meaning some of the weaknesses pertaining to SVD are addressed by RBM, but RBM also loses some of SVD's advantages. Having larger amount of data will hopefully highlight the discrepancies between the two even more.

Annotated Bibliography

↑ ^1.0 ^1.1 ^1.2 ^1.3 Srebro, Nathan, Jason Rennie, and Tommi S. Jaakkola. "Maximum-margin matrix factorization." Advances in neural information processing systems. 2005.
↑ ^2.00 ^2.01 ^2.02 ^2.03 ^2.04 ^2.05 ^2.06 ^2.07 ^2.08 ^2.09 ^2.10 ^2.11 ^2.12 ^2.13 ^2.14 Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. "Restricted Boltzmann machines for collaborative filtering." Proceedings of the 24th international conference on Machine learning. ACM, 2007.
↑ ^3.0 ^3.1 ^3.2 Sarwar, Badrul, et al. Application of dimensionality reduction in recommender system-a case study. No. TR-00-043. Minnesota Univ Minneapolis Dept of Computer Science, 2000.
↑ ^4.0 ^4.1 Goldberg, David, et al. "Using collaborative filtering to weave an information tapestry." Communications of the ACM 35.12 (1992): 61-71.
↑ Hinton, Geoffrey E. "Training products of experts by minimizing contrastive divergence." Neural computation 14.8 (2002): 1771-1800.

To Add

Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.

Describe some research that has been published. You should choose two papers, by different authors (no authors in common) where one builds on the other. You should describe the background, and then describe the incremental contribution of one paper over the other. What was the actual contribution of the latter paper? The reader should be able to understand the problem, where it fits into the big picture, the solution proposed and how that solution was evaluated. Add your own thoughts on how successful it was and how it can be improved. See http://www.cs.ubc.ca/~poole/cs522/2019/readings.html for some suggested topics.

[:2-1] 1.0 ^1.1 ^1.2 ^1.3 Srebro, Nathan, Jason Rennie, and Tommi S. Jaakkola. "Maximum-margin matrix factorization." Advances in neural information processing systems. 2005.

[:1-2] 2.00 ^2.01 ^2.02 ^2.03 ^2.04 ^2.05 ^2.06 ^2.07 ^2.08 ^2.09 ^2.10 ^2.11 ^2.12 ^2.13 ^2.14 Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. "Restricted Boltzmann machines for collaborative filtering." Proceedings of the 24th international conference on Machine learning. ACM, 2007.

[:0-3] 3.0 ^3.1 ^3.2 Sarwar, Badrul, et al. Application of dimensionality reduction in recommender system-a case study. No. TR-00-043. Minnesota Univ Minneapolis Dept of Computer Science, 2000.

[:3-4] 4.0 ^4.1 Goldberg, David, et al. "Using collaborative filtering to weave an information tapestry." Communications of the ACM 35.12 (1992): 61-71.

[5] Hinton, Geoffrey E. "Training products of experts by minimizing contrastive divergence." Neural computation 14.8 (2002): 1771-1800.

[1]

[2]

[3]

[4]

[5]