Course:CPSC522/Multinomial Variational Autoencoders for Predicting Gender in MovieLens

Title

Using multinomial variational autoencoders to predict the gender of users in the MovieLens dataset.

Principal Author: Jeffrey Niu
Collaborators:

Abstract

Recommender systems are prevalent in many online services, from streaming services such as Netflix, to online retailers such as Amazon. These recommender systems take in a user's interactions with the available items to suggest new items for them to watch or buy. In this article, we consider the problem of predicting the user's gender from their interactions. Our main hypothesis states that a multinomial variational autoencoder (MultiVAE) can be used to better predict gender compared to the traditional approach of matrix factorization. Then, we explore two related hypotheses concerning the method used to preprocess the interaction matrix and the manner in which the MultiVAE is trained. To test our hypotheses, we use the MovieLens dataset, which contains user ratings of movies.

Builds on

Multinomial VAEs are a form of variational autoencoder, which uses variational inference. The problem presented in this article is an extension of recommendation systems which have been tackled with matrix factorization and personal agents. The task of predicting user gender from movie ratings comes from an exploration by Kazemi et al.

Related Pages

Recommender systems are further explored in many other articles. This article on improving recommendation system by integration explores a method for integrating multiple datasets together for recommendation systems. This article on evaluating, selecting, and applying recommendation methods compares a method in collaborative filtering to other systems.

Content

Background

Recommender systems provide users with a list of recommendations of unseen items given a list of items that they have already interacted with. On Netflix and other streaming services, the system would have access to the previous content that the user watched as well as their ratings. From this information, the system will provide the user with content they have never seen. Likewise, for Amazon and other e-commerce platforms, the system would take a user's purchases and their ratings to suggest new products for the user to purchase. These systems are critical to increasing a service's user engagement.

To model the recommendation problem, consider a service with $N$ users and $M$ items. Each of these users will have rated one or more items. This can be represented as a sparse matrix, where the $(i,j)$ -th entry denotes that user $i$ has given the $j$ -th item a rating. A recommender system will take this matrix and output suggestions for each user.

Predicting Gender from Movie Reviews

In this article, we will attempt to predict a user's gender from the movies they have rated. This task is a form of aggregation, where a single variable (gender) depends on many other variables (movie ratings). As Kazemi et al. motivate, existing relational probabilistic models fail to perform aggregation without losing too much information. Thus, we must look to new methods to perform aggregation.

MovieLens Dataset

The dataset used for this article is the MovieLens dataset. MovieLens is an online platform where users can rate and explore movies. Over the years, they have released progressively larger datasets with more users, movies, and ratings. For this article, we will use the MovieLens 100k and 1m datasets, which contain 100,000 and 1,000,000 ratings, respectively. The dataset also contains the gender and occupations for each user.

The dataset is provided as a set of tuples of the form $(u,i,r,t)$ , where $u$ is the user's ID, $i$ is the ID of the movie the user rated, $r$ is their rating on a scale of 1-5, and $t$ is the timestamp of the rating. To process this data, we convert the tuples into matrix form, where each $(u,i)$ entry will contain the rating of the user. Then, we binarize the matrix, following the method of Liang et al., where the matrix entry is 1 if the user gave a rating of 4 or higher. Otherwise, the entry is set to 0. The rationale for this decision is that we want to capture what the user likes, not dislikes.

Challenges

The major challenge in predicting a user's gender is the sparsity within the data. Almost all the users rate only a few movies, meaning we only have a few data points to work with. The most difficult users to predict are the ones with only one or two ratings. Another challenge with this data is the gender imbalance: only 28% of the users are female.

Collaborative Filtering

Collaborative filtering is one of the most popular approaches in recommender systems. This approach discovers similarities between users and items to predict the user's preferences. For example, say Alice and Bob both highly rate Toy Story, Finding Dory, and Frozen. Bob also highly rated Inside Out. Since Alice has many similar preferences as Bob, the recommender system will recommend Alice Inside Out to watch. Likewise, if Bob disliked Saw, then the recommender system will avoid putting Saw in Alice's recommended list. We will extend existing collaborative filtering methods for predicting gender.

Matrix Factorization for Collaborative Filtering

Matrix factorization is a popular method in collaborative filtering, being the basis for the method that won the Netflix Prize competition. In matrix factorization, we model the $N\times M$ rating matrix using a latent space of dimension $K$ . This factorization splits the matrix into two matrices of dimension $N\times K$ and $K\times M$ . The latent space finds commonalities between the items. In the Netflix example, one dimension of the latent space may represent the scariness of a movie, while another may represent the family-friendliness of a movie. A movie like Saw will have a high value in the scariness dimension, while a movie like Frozen will have a high value in the family-friendliness dimension.

Each user $i$ is represented by a $K$ -dimensional vector, $p_{i}$ , indicating their preference for each of the factors in the latent space. Each item $j$ is represented by a $K$ -dimensional vector, $q_{j}$ , indicating how much it possesses each of the factors. Thus, to estimate what user $i$ would have rated item $j$ , we compute $p_{i}^{T}q_{j}$ . The items we recommend are simply the ones with the highest estimated rating. The main advantage of matrix factorization is that it handles the sparsity of the matrix well, as the matrix factorization algorithm does not consider missing entries. For a full explanation of the matrix factorization procedure for recommender systems, see this article.

We can use this matrix factorization framework to predict the user's gender in MovieLens. We can simply perform logistic regression on the latent representation of each user, $p_{i}$ to predict the user's gender.

Multinomial Variational Autoencoder

The multinomial variational autoencoder (MultiVAE) advances on the standard variational autoencoder (VAE) architecture in two ways:

The data distribution is represented by a multinomial likelihood rather than a Gaussian likelihood used in most VAE architectures.
The VAE objective is reformulated to reduce the effect of regularization

Multinomial Likelihood

The multinomial distribution is the generalization of the binomial distribution to multiple distinct events. The parameters for the multinomial distribution are given by ( $n$ , $p_{1},...,p_{k}$ ), where $n$ is the number of trials, and $p_{1},...,p_{k}$ are the probabilities of each event, $\sum _{i}p_{i}=1$ .

Next, we define the data generation procedure for the $i$ -th user's binarized ratings. Let $\mathbf {x} _{i}$ be the observed binarized ratings for user $i$ . We start by sampling a $K$ -dimensional latent representation $\mathbf {z} _{i}$ from a standard Gaussian prior. We then pass $\mathbf {z} _{i}$ through a non-linear function $f_{\theta }(.)$ , whose parameters $\theta$ are given by a neural network, to produce a probability over the $M$ movies, $\pi (\mathbf {z} _{i})$ . We normalize $\pi (\mathbf {z} _{i})$ using the softmax function to obtain a set of probabilities that sum to one. Given $N_{i}=\sum _{j}x_{ij}$ , the total number of ratings from a user, we assume that the observed ratings are sampled from a multinomial distribution with parameters $(N_{i},\pi (\mathbf {z} _{i}))$ . Finally, we can calculate the log-likelihood for user $i$ as:

\log p_{\theta }(\mathbf {x} _{i}\mid \mathbf {z} _{i})=\sum _{j}x_{ij}\log \pi _{j}(\mathbf {z} _{i})

The rationale for using a multinomial likelihood is that it better models the rating matrix. Since there is a limited probability mass for each user, given by

N_{i}

, the model must budget accordingly to assign higher probability mass to movies that the user would highly rate.

Reformulated VAE Objective

The standard VAE objective ${\mathcal {L}}$ consists of two components: a negative reconstruction error and a Kullback-Leibler (KL) divergence that can be viewed as a regularization term:

{\begin{aligned}\log p(\mathbf {x} _{i};\theta )&\geq \mathbb {E} _{q_{\phi }(\mathbf {z} _{i}\mid \mathbf {x} _{i})}[\log p_{\theta }(\mathbf {x} _{i}\mid \mathbf {z} _{i})]-KL(q_{\theta }(\mathbf {z} _{i}\mid \mathbf {x} _{i})||p(\mathbf {z} _{i}))\\&\equiv {\mathcal {L}}(\mathbf {x} _{i};\theta ,\phi )\end{aligned}}

The authors of MultiVAE believe that the standard VAE objective is over-regularized. The KL divergence term is useful for applications where ancestral sampling is important. Ancestral sampling refers to the process of generating unseen observations using the VAE. Since the purpose of MultiVAE is to discover the latent factors underlying user ratings, not to generate imaginary user ratings, MultiVAE can put less emphasis on the KL divergence term. Thus, they reformulate the objective in MultiVAE as:

{\mathcal {L}}(\mathbf {x} _{i};\theta ,\phi )\equiv \mathbb {E} _{q_{\phi }(\mathbf {z} _{i}\mid \mathbf {x} _{i})}[\log p_{\theta }(\mathbf {x} _{i}\mid \mathbf {z} _{i})]-\beta \cdot KL(q_{\theta }(\mathbf {z} _{i}\mid \mathbf {x} _{i})||p(\mathbf {z} _{i}))

where

\beta

controls the strength of the regularization. To determine the value of

\beta

, the model is trained with

\beta

starting at

0

. Then, the KL divergence term is slowly annealed over many training iterations until it reaches

1

. We can then see which

\beta ^{*}

yielded the best results. Finally, we can retrain the model, capping the increase of

\beta

to

\beta ^{*}

.

Using MultiVAE to Predict Gender

The procedure to predict gender using MultiVAE is similar to matrix factorization. Once again, we perform logistic regression over the latent representation generated by the encoder network for all the users. Specifically, given a training and testing splits of the MovieLens dataset, we can compute the latent representations on the full dataset as the latent representations are independent of gender. Then, we train the logistic regression weights using the training data, and evaluate on the test dataset.

Does MultiVAE Better Predict Gender?

Our first hypothesis is that using the latent space learned by MultiVAE will better predict the gender of the MovieLens users based on their ratings compared to matrix factorization. As MultiVAE was shown to perform better on recommending users movies, we anticipate that its latent representations are more expressive than the representations produced by matrix factorization. Hence, performing logistic regression on MultiVAE latents will help better predict gender.

To test the first hypothesis, we will evaluate the gender predictions using three metrics: mean squared error (MSE), log loss (LL), and accuracy of gender predictions. MSE and LL both use the outputs from logistic regression that represent the probability a user is female, while the accuracy rounds the outputs to 0 or 1 and calculates the percentage of correctly predicted user genders. Compared to MSE, log loss punishes extremely incorrect predictions more. A better performing algorithm will have lower MSE and LL, and higher accuracy.

Experiments

To test the hypothesis, we ran the matrix factorization algorithm and the MultiVAE on both the MovieLens-100k and MovieLens-1m datasets. In all experiments, we use a 200-dimensional latent space. We randomly divided the dataset into 70% training and 30% testing splits for the purpose of predicting gender. We also provide a baseline of always predicting male (the majority class). The results are shown below:

Performance on MovieLens-100K
	Mean Squared Error	Log Loss	Gender Prediction Accuracy
Predict Male Baseline	0.2826	$\infty$	71.73%
Matrix Factorization	0.1907	0.8188	72.08%
MultiVAE	0.1608	0.5018	76.68%

Performance on MovieLens-1m
	Mean Squared Error	Log Loss	Gender Prediction Accuracy
Predict Male Baseline	0.2809	$\infty$	71.70%
Matrix Factorization	0.1835	0.7888	72.84%
MultiVAE	0.1299	0.4100	81.07%

Does Changing the Binarization Method Improve Predictions?

Previously, we stated that the ratings matrix is binarized such that an 1 indicates that the user gave the movie a rating of 4 or higher, which follows the methods of Liang et al. They chose this formulation because their goal was to find movies that the user would enjoy. Thus, they want to binarize the movies into movies they liked and movies they did not like. However, since our task has moved away from making recommendations to aggregating the ratings for predicting gender, a new binarization strategy could yield better results.

We now propose a second binarization method. Here, an 1 indicates that the user rated the movie. We ignore whether the user thought the movie was good or bad, only caring about whether the user rated it. The reasoning for this method is that the user must decide on a movie to watch, which is already influenced by their gender. Even though they might not have enjoyed the movie, it still represents a choice to give the movie a try. For example, suppose males are disproportionately drawn towards action movies. If a user gave the action movie The Last Days of American Crime a rating of 1, we should not discard this evidence. The fact that the user watched this terrible movie strongly suggests that the user is male.

Experiments

To test the second hypothesis, we run the same experiments as the previous hypothesis. The results are shown below:

Performance on MovieLens-100k with New Binarization
	Mean Squared Error	Log Loss	Gender Prediction Accuracy
Predict Male Baseline	0.2826	$\infty$	71.73%
Matrix Factorization	0.1896	0.8110	71.73%
MultiVAE	0.1517	0.4696	78.80%

Performance on MovieLens-1m with New Binarization
	Mean Squared Error	Log Loss	Gender Prediction Accuracy
Predict Male Baseline	0.2809	$\infty$	71.70%
Matrix Factorization	0.2047	0.8668	71.80%
MultiVAE	0.1271	0.4110	82.94%

Does Training MultiVAE on Only Training Data Worsen Predictions?

In the first hypothesis, we trained MultiVAE on the entire dataset as the gender is predicted on the latent representations produced by MultiVAE. However, if we are given a brand new batch of users, the fastest way to predict their genders would be to encode their ratings into the latent space and predict their gender using the learned logistic regression weights. This process is not represented in the first hypothesis' experiment. We hypothesize here that only training MultiVAE on the 70% training data will not influence the performance.

Experiments

Specifically, we trained the MultiVAE model on the 70% training data and learned the logistic regression weights using the latent representations of the training data. Then, we pass the test data through the encoder to retrieve latent representations. Then, we pass the latent representation into the logistic regressor that outputs the probability the user is female. We show the comparison between MultiVAE performance in the first hypothesis to this different training method.

Performance on MovieLens-100k
	Mean Squared Error	Log Loss	Gender Prediction Accuracy
MultiVAE - Train on all data	0.1608	0.5018	76.68%
MultiVAE - Train only 70% data	0.1854	0.5712	71.75%

Performance on MovieLens-1m
	Mean Squared Error	Log Loss	Gender Prediction Accuracy
MultiVAE - Train on all data	0.1299	0.4100	81.07%
MultiVAE - Train only 70% data	0.1429	0.4447	78.81%

Conclusion

From the experiments, we find that the first hypothesis is true. In MovieLens-100k, the difference in MSE and gender prediction accuracy is small, but the difference in log loss is greater. These numbers suggest that MultiVAE's gender predictions make fewer very incorrect predictions. We see a larger difference in the evaluation metrics in the MovieLens-1m dataset. Here, we have significant improvements in all three metrics. These results suggest that when given access to more data, MultiVAE can better capture the differences between genders in the latent representation compared to matrix factorization.

The experiments do not show conclusive evidence that the new binarization method (second hypothesis) improves performance. In matrix factorization, the metrics slightly decrease, while in MultiVAE, the metrics slightly increase. We believe this result can be explained by better performance on users with fewer ratings and poorer performance on users with more ratings. For users with fewer ratings, this new binarization method allows the model to exploit the choices the user made in selecting movies to watch. However, users with many ratings may watch movies that are opposite to their gender and rate them poorly. Then, the new binarization method would lose this information as any rating is treated the same.

Finally, the experiments suggest that the third hypothesis is false. There is a moderate dropoff in all three metrics when switching to training MultiVAE on only the training data. The dropoff in the MovieLens-100k dataset is likely due to the even fewer users it gets to train on. In the MovieLens-1m dataset, there is still a dropoff, though less significant. Moreover, the performance is still better than matrix factorization, which suggests that even without training MultiVAE on the full dataset, MultiVAE is a better predictor than matrix factorization.

Annotated Bibliography

Harper, F. M., & Konstan, J. A. (2015). The MovieLens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TIIS), 5(4), 1-19.

Kazemi, S. M., Fatemi, B., Kim, A., Peng, Z., Tora, M. R., Zeng, X., ... & Poole, D. (2017). Comparing aggregators for relational probabilistic models. arXiv preprint arXiv:1707.07785.

Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30-37.

Liang, D., Krishnan, R. G., Hoffman, M. D., & Jebara, T. (2018). Variational Autoencoders for Collaborative Filtering (arXiv:1802.05814). arXiv. https://doi.org/10.48550/arXiv.1802.05814

To Add

Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.

Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0