Course:CPSC532:StaRAI:2017:As1:Q2

Predicting Gender from Movie Ratings

Here we want a brief title of the method and the accuracy on the test set. You may only train on the specific data given (the ratings before 884673930 and the gender of the people who rated before 880845177), and report on the accuracy on the test set.

Method	Sum-of-squares Error	Log loss
Predict 0.5	42.75	171
Training average	36.92	153.96
Naive	84	27904.20
NaiveWithGenderProbability(Moumita)	44.11	193.99
Average of movies, same rating + closest ratings (AK)	36.23	151.34
Average of movies rated v1	36.30	151.64
Average of movies rated v2	36.09	150.91
Deep Relational Learning v1	33.81	142.68
Average of movies with rating v4	34.5	145.7
KNN/Cosine/3/ratingfeature	47	15613
KNN/Cosine/5/ratingfeature	46	15280
SVM classifier, probabilities by Platt scaling (AK)	38.11	161.08
NaiveHiddenMean, std=1	43.00	341.36
NaiveHiddenMean, std=2	35.71	161.25
Naive Bayes with uniform prior	37.9	400.4
Naive Bayes with hierarchical prior	41.4	347.5
Naive Bayes with limited children	33.25	141
Neural network with PCA (BF)	55.62	273.88
L2-logistic regression with PCA (BF)	41.11	177.34
L2/L1LogisticRegwithTopKRatings(Moumita)	35.19	147.14
LogisticReg as MLN in Alchemy, disc. weight learning (AK)	39.17	212.53
LogisticReg as MLN in Alchemy, gen. weight learning (AK)	51.79	361.83
Matrix Factorization and Support Vector Machine	39.99	174.96
Matrix Factorization and Logistic Regression (BF)	34.48	144.28
BPMF and Logistic Regression	32.52	139.15
Semi Supervised BPMF	36.01	150.79
Deep Relational Learning v2	32.1660	138.2096
RDN-Boost (BF)	36.85	153.70
LogReg + Limited parents + early stopping	31.42	136.50
Anglican - with hidden var. movies clustered. isRated: rating>=0 (Matt)	38.81	162.84
Anglican - with hidden var. movies clustered. isRated: rating>=4 (Matt)	55.60	215.38

Method Descriptions

For each method, please provide a description of the method (as may appear in a research paper), and a link to the code that produces these results. Make sure your code specifies its copyright.

Naive

Naive Model is a super naive model which can be also seen as a graphical model.

Let $i$ be the index for each person and $j$ to be the index for movie.

Let $G_{i}$ be the gender for a specific person. Let $R_{ij}$ be the rating of that person on that movie.

Assume $R_{ij}\sim N(\mu _{G_{i}},1)$ , $G_{i}\sim Bernoulli(0.5)$

We could learn $\mu _{G_{i}}$ easily as the MLE for observation is the MLE for i.i.d. normal distribution and that exactly require the average of the observations to be the mean.

Then maximize $\prod _{j\in M_{i}}P(G_{i}|R_{ij})$ for testing set which is the same as maximizing $\prod _{j\in M_{i}}P(R_{ij}|G_{i})$ because even our prior is naive.

A possible slight future modification is, make our prior to have the probability of ave_F. This require extra coding for calculating the exact value of $P(G_{i})\prod _{j\in M_{i}}P(R_{ij}|G_{i})$ . But given by the poor performance of the Naive Model this modification might not be helpful.

Link to Code

NaiveHiddenMean

Same as naive, except now the mean is not shared across. i.e. $R_{ij}\sim N(\mu _{j,G_{i}},\sigma )$ . And we set the prior $G_{i}\sim Bernoulli(ave_{F})$ .

This is alternatively equivalent to a Naive Bayes model.

We use MLE to learn the value of $\mu _{j,G_{i}}$ again and set the prediction to be $\prod _{j\in M_{i}}P(G_{i}|R_{ij})$ which is proportion to $P(G_{i})\prod _{j\in M_{i}}P(R_{ij}|G_{i})$ .

The standard deviation we have here is a hyperparameter. It has large effect on performance but so far no good ways of fixing it..

NaiveWithGenderProbability (Moumita)

The probability of the gender being male or female is multiplied with the average rating of male or female.

Average of movies rated

These considers the movies that the user rated, and the genders of the people who rated these movies.

v1: let fc = sum over all of the movies that u rated of the number of females who rated that movie. let mc = sum over all of the movies that u rated of the number of male who rated that movie. return fc/(fc+mc).

v2: determine the gender average for each movie that u rated, and average these.

http://cs.ubc.ca/~poole/cs532/2017/as1/predictors.py

Average of movies with rating v4

is like v2 but only considers people who gave same rating to a movie as the user. See: http://www.cs.ubc.ca/~poole/cs532/2017/as1/Gender_from_ratings.html which can be generated from http://www.cs.ubc.ca/~poole/cs532/2017/as1/Gender_from_ratings.ipynb

KNN

K-Nearest Neighbour.

FIrst option refer to the metric using, second option refer to the value of k. Third option refer to the input feature.

"ratedfeature" means 1 if and only if a rating persists. "ratingfeature" means the true rating.

It doesn't perform very well. Reading the predicting result I realize that is probably caused by the fact that there are too many males so not many female predictions are made.

Deep Relational Learning v1

Let m represent the set of all movies, u represent the set of all users, and for some $U\in u$ and $M\in m$ , R(U, M) represent the rating of U for M. Also let $H_{1}(m),H_{2}(m),\dots ,H_{k}(m)$ represent k hidden features for each movie such that $H_{i}(M)$ represents the i-th hidden feature of M. All hidden features are initially assigned a random value.

Now consider a neural network whose inputs are the $R(u,m)$ matrix as well as the $H_{1}(m),H_{2}(m),\dots ,H_{k}(m)$ vectors. The first hidden layer of this neural network contains k+1 neurons where each neuron is a vector $N_{1i}(u)$ defined as follows:

$N_{10}(u)=b_{10}+w_{10}*\sum _{m}R(u,m)$

$N_{1i}(u)=b_{1i}+w_{1i}*\sum _{m}R(u,m)*H_{i}(m)$

where b shows the bias and w shows the weight. These vectors go through an activation layer which can be a Sigmoid layer. Let $A_{1i}(u)$ represent $N_{1i}(u)$ after applying the activation function. Then the next hidden layer is a normal neural network layer with t neurons where the value of the j-th neuron is computed as follows:

$N_{2j}(u)=b_{2j}+\sum _{p=0}^{k}N_{1p}(u)*w_{2pj}$

These vectors again go through an activation layer. There can be multiple hidden layers of this type. The last layer must contain only one neuron whose outputs $A(u)$ correspond to the probabilities of the the users being, say, male. We can train this neural network and learn the biases, weights, and hidden features using back-propagation. I regularize the final predictions of my model towards the mean and control the amount of regularization using a hyper-parameter.

For the results I have reported, I have used two hidden features for each movie (k=2) and my network has one hidden layer. Given that the results I obtain may differ in each run, I ran my code 5 times and reported the average.

Deep Relational Learning v2

There are several movies in our dataset having only 1 or 2 ratings and this causes my previous model to overfit. let M be rated by U and U be a male, then M gets a very high weight towards maleness. This is problematic if a female in the training set has also rated M. This is obviously a drawback of my model which I didn't know about before doing this experiment.

To solve this, I ignored all movies having <= 10 ratings. This time, I ran my code 10 times and reported the best result (I'm allowed to do so because our current test set is actually a validation set).

Naive Bayes with uniform prior

Naive Bayes with uninformed uniform prior for each movies rating given gender. This uses the Laplace prior.

Naive Bayes with hierarchical prior

Naive Bayes with informed hierarchical prior (using the rating distribution for each gender in the population as the prior for each movie). The prior was a pseudo-count of 20 times the population distribution.

Naive Bayes with limited children

Naive Bayes, but limited to 6 children, and with a pseuducount (L2 regularizer) of 10. If there are more than 6 children, average over all assignments to six children. (6 was the minimum found on test data, so this needs to be validated on a different test set).

The last 3 are available from http://cs.ubc.ca/~poole/cs532/2017/as1/Gender_from_ratings_with_nb.html (or the code is at http://cs.ubc.ca/~poole/cs532/2017/as1/Gender_from_ratings_with_nb.ipynb )

L2LogisticRegwithTopKRatings (Moumita)

Considered only top 60 ratings per user and solved using Logistic Regression with L2 regularizer. Also considered only ratings>3 to be 1 otherwise 0. This means the model considers that if a person rated a movie >3 then that is equal to 1. If a person rated <=3 that is equivalent to not rated. Rather than predicting the exact values, I am predicting the probability.

Neural network

Because of having 1512 input features, a neural network failed to learn a model for this data-set. So I first applied a PCA (principle component analysis) to the data-set to reduce the dimension (number of features) and then I build a neural network on the new data-set.

Matrix Factorization and Support Vector Machine

Using Matrix Factorization to generate $R=UV$ and use $U$ as a dimension-reduced feature to feed it to the input of SVR.

Matrix Factorization and Logistic regression

Using matrix factorization, I build new train and test data-sets. These data-sets contain user_ids and the set of features learned for each user via matrix factorization. I build a Logistic regression model over the new data-sets. I test the overall model (matrix factorization and logistic regression) for different k (number of matrix factorization features learned for each user and each movie). The chart below shows log-likelihood on the test data vs. k. As we see in the chart, we achieve the best log-likelihood when k = 5. The numbers I reported in the results table are the ones with k=5.

This chart shows log loss for different values of k.

RDN-Boost

RDN-Boost learns regression trees with a combination of thresholding functions. I test RDN-Boost on the three datasets. My test results show that RDN-Boost performance is as good as train average and it does not perform well, which we take as evidence that that the underlying aggregations are not extracting useful information.

BPMF and Logistic Regression

Same as the method above. Except, instead of Matrix Factorization, I used Bayesian Probabilistic Matrix Factorization, a graphical model variant of Matrix Factorization.

The introduction of such method could be read here: https://www.cs.toronto.edu/~amnih/papers/bpmf.pdf

The performance of this method was, however, very unstable. The score I got above was only from one seed and some other seeds could give answers that are a lot worse.

The instability comes with two part, first due to the initial randomization of parameter, second due to the MCMC method required by BPMF.

I am still researching on ways to understand this instabilities and if anything could be done to fix it.

Semi Supervised BPMF

I also use Bayesian Probabilistic Matrix Factorization model for this method. But, instead of training a linear model on the resulting projections on the feature of each user, I set the user feature $v_{0}=0.5$ if the user is female and $v_{0}=-0.5$ otherwise, and use them as a semi supervised label throughout the MCMC chain.

Compare to the previous method, it benefit from the fact that BPMF realize the existence of gender right at the beginning of the training, but it suffers from less representation, a reason why it didn't perform very well.

ProbLog

Trying 2 models: naive Bayes and Gender depending on Popular and Unpopular ratings

Logistic Regression with limited number of parents and early stopping

I applied logistic regression limiting the number of parents to K and tested different values of K While training, I calculate the log-loss on validation set and select the parameters with best performance on this set (i.e. early stopping). Below are the results I got for different values of K:

K log-loss sum-squared-error

1 148.72 35.36

3 144.63 34.18

6 139.04 32.30

8 139.00 31.94

10 136.50 31.42

12 138.41 32.01