Course:CPSC532:StaRAI:2017:XingZeng

All code are at https://bitbucket.org/Tamaki_Sakura/cpsc532p/src

CPSC 532 Xing's Results

Method	ml-60k ASE	ml-60k Log loss	ml-1m ASE	ml-1m Log Loss	Yelp ASE	Yelp Log Loss
training average (pseudocount=50)	0.2139	0.8932	0.2044	0.8642	0.2364	0.9605
KNN/Cosine/5/ratingfeature (NS)	0.2599	86.3277	?	?	?	?
BPMF(5) with Logistic Regression/Ratings (NS)	0.1950	0.8223	?	?	?	?
BPMF(5) with Logistic Regression/Rating > 3	0.2046	0.8673	0.1896	0.8083	0.2363	0.9600
BPMF(5) with Logistic Regression/Int Var/Rating > 3	0.2003	0.8497	0.1894	0.8077	?	?
BPMF(5) with Logistic Regression/One Start With User Gender/Int Var/Rating > 3	0.1927	0.8235	0.1871	0.8015	0.2361	0.9594
Naive Bayes/Normal Distribution/Rating > 3	0.2160	0.9006	?	?	?	?
Markov Logic Network with Hidden/Rating > 3	0.2124	0.8882	0.2048	0.8652	0.2344	0.9538
Markov Logic Network with 2 weights per item (CV)	0.1812	0.7961	0.1353	0.6098	0.1940	0.8196
Markov Logic Network with 2 weights per item (Cheated)(NS)	0.1778	0.7764	0.1351	0.6081	0.1910	0.8093

The first one isn't mine. This is just kept for comparison.

All denote NS are non standard result and should not appear in publication.

My first Markov Logic Network with 2 weights is trained using sklearn with fixed "almost no regularization" and iteration selected by cross validation. Each take iteration 7/12/30

The second one denote with NS is trained with "cheated" Each take iteration 4/6/24

Description

Simple models

The first thing I tried in approaching to this problem is to use some simple directed graphical model, because they are easy to implement & good as a starting point.

The first model I tired is I assume each gender have a mean of movie and use normal distribution to calculate the likelihood of the data. The mean of the gender is learned using MLE and the probability of user's gender given by his rating (which is proportional to the probability of his rating given by his gender) is used as the output. This model doesn't work because it is way too simple.

Then I make some small changes in the model such that the mean is not shared across gender, it is only shared across those have the same gender and the same movie. The average percentage of female is also used as a prior. Then, I make my model work essentially same as a NaiveBayes model where each user's rating is treated as an element in the dataset. The training is done in MLE. It works better than the preivous one but still only as good as predicting average. Even then it require tuning of a variance of hyperparameter to make sure it is not worse than predicting average. So it is not entirely what I want. I think this is partly caused by the fact that normal distribution shouldn't be used here because the other NaiveBayes model using Categorical Distribution actually works well.

Then I start to use some more common model on this problem to see if they provide anything interesting. Most notably, I use K-Nearest-Neighbour. For a problem that had a huge dimension as the collaborative filtering problem we had here, a cosine distance metric is a more natural choice as it is not affected by the absolute value. However if also didn't work very well, even if I force it to predict probability based on distance to the closer elements - probably because our problem internally is still not metric based.

Matrix Factorization

Next ,I start to research Matrix Factorization related model because these type of models make most of the sense in the field. The first thing I tired is the regular Matrix Factorization. I did the aggregation of matrix factorization based on treating the training value we get from user's feature as a dimenion-reduced result, and train simpler model on it. The first simpler model I trained on it is Support Vector Regression, but SVR for some reason didn't perform very well, probably because SVR are mostly designed for larger problem. When I switch to a more lightweighted model like Logistic Regression, it start to give acceptable performance.

Bayesian Probabilistic Matrix Factorization and its variants

Afterwards, I start to research on Bayesian Probabilistic Matrix Factorization related model. I spend most of my time and focus on this method as it as been long treated as a major method in collaborative filtering. I start this with training Logistic Regression on use feature, its performance can be good and are better than Matrix Factorization, but not always - depend on the randomization of the initial parameter. I realize that this is mostly caused by the instability of BPMF method as it is trained using MCMC. I tired to change the hyperparameter and it seems like a higher hyperparamter is a bit helpful in reducing this randomization, but not by a lot.

Then, I start to consider how can I change the prediction value in the BPMF part from rating to rating > 4 - a logical variable. The ideal way is to make it a logical variable is to use logistic function on the original observation variable where the result from the logistic function is the probability of observe positive or not. But this induce a huge trouble: the normal prior we had in BPMF is no longer the conjugate prior of the new observation variable which now follows a Bernoulli distribution with probability being the resulted from the logistic function. This means complicated method (like Rejection Sampling) would be used in training this mode. To solve this issue, I used the idea of variational inference here. I just set the old prediction variable to be a large positive value if I observe positive, and a large negative value if I observe negative.

I also tried adding Semi-supervised in BPMF. This is done by not updating 1 component of the user feature vector and set it to be 1 for male/-1 for female all the time. This didn't work very well however since it only matches to 1 variable in the movie feature vector.

However, I do find seeding part of the user feature vector to be the gender useful in getting better performance, even across different randomization of initialization of the feature vector. The increase of performance of seeding in BPMF is also greater than the increase of seeding in MF, shows how important inital parameters are for BPMF.

Other Methods

After I've done all of it, I start to try it on the new dataset-namely 1m and Yelp. I realize it didn't work so well on the New Dataset. I don't understand why that is the case at beginning but now I think it is because my iteration is optimized for 60k and I did not re-opitmize the iteration. This also shows that BPMF may be indeed slow to get convergence.

I also spend some time on researching MLN with weights per item, after Zilun Peng shows that this method provide a good performance.

I tried to implement it from its equivalence form in Logistic Regression and train the model using sklearn's logistic regression class to train it. I was able to show that the good result in MLN with weights per item is caused by early stopping in gradient descent method, and this type of performance is not always observed if you use different optimization method.

Future Work

Future work related to MLN

I think we don't have enough understanding of why a simple MLN with weights per item had such a good performance when not fully trained. It would be worth looking into if we could understand what happened really in that method and figure out if there are automatic method (either being just pure regularization or optimization based) to simulate this performance without deciding number of iteration by cross validation.

Future work related to BPMF

Our method of training BPMF use Gibbs sampling. There have been some massive changes in optimization methods now, some other sampling method may be not that sensitive to initialization of variable and may be faster in terms of rum time.

Besides, we haven't test BPMF enough on larger set. I think probabilistic method are usually complicated enough to generalize well, it might be caused by the wrong choice of iteration or number of hidden feature and it is worth looking into whether it can still generalize.