Course:CPSC532:StaRAI:2017:Bahare

CPSC 532 Bahare's Results

Method	ml-60k ASE	ml-60k Log loss	ml-1m ASE	ml-1m Log Loss	Yelp ASE	Yelp Log Loss
Predict 0.5	0.25	1	?	?	?	?
Neural network with PCA	0.3252	1.6016	?	?	?	?
L2-logistic regression with PCA	0.2404	1.0370	?	?	?	?
MF and LR (k = 5) / rating	0.2016	0.8437	?	?	?	?
RDN-Boost	0.2154	0.8988	0.204	0.864	0.234	0.953
MF and LR (k = 5)/Rating > 3	0.1998	0.8440	0.1929	0.8240	0.2358	0.9586
MF and LR (k = 5) /Rated	0.2155	0.8995	0.2055	0.8683	0.2361	0.9595

Neural network

A Logistic Regression model assumes all features are independent, while movies are not independent. For example God Father 1 and God Father 2 are two movies that are not independent of each other while using Logistic regression model we assumes they are independent. But a neural network model does not need feature independence assumption and in our case can use dependency of movies for predicting gender of the user. So I started with a neural network model. I translated the Movie-lens data-set to a new data-set with considering each movie as a feature. Because of having 1512 input features, a neural network failed to learn a model for this data-set. So I first applied a PCA (principle component analysis) to the data-set to reduce the dimension (number of features) and then I build a neural network on the new data-set.

Principle Component Analysis

In search (Information Retrieval) people use PCA to find correlations between words. PCA reduces the dimension of the attributes. So in search field, we would like words that are about the same category be mapped to the same dimension. For example words family, love, father, mother, ... should be mapped to the same dimension. In our case we can extract correlation between movies using PCA. We would like movies that are dependent or share information map to the same dimension.

L2-logistic regression with PCA

In order to get a sense of how PCA is working on this data-set, I decided to test the L2-Logistic regression model on the new data after PCA. So I saw PCA is not working really well on this data. The reson PCA is not working well is that it uses the data-set without the labels. But in the Logistic Regression model we use label of the examples.

RDN-Boost

RDN-Boost learns regression trees. We use RDN-Boost as a test for the standard explicit exaggerators. If the standard aggregators are useful, then RDN-Boost should perform well. Our test results show that RDN-Boost does not perform well, which we take as evidence that that the underlying aggregations are not extracting useful information. I tested RDN-Boost method on the three data-set. I used the following link for the code and the tutorial of how to use it: http://ftp.cs.wisc.edu/machine-learning/shavlik-group/WILL/rdnboost/

Considering time-stamps of ratings

I suspected there is a pattern in rating the movies. I hypothesize people rate all the items (movies) they love when they join these sites. One reason for rating all the items they love is to get good recommendations. i tested this hypothesis with counting number of movies each user rated on a specific date and I found no pattern.

More informative movies

One problem we have in this data-set is that there are lots of movies with few ratings and some movies with lots of ratings. One hypothesis to solve this issue is to find more informative movies and ignore the useless ones. So I hypothesize the movies with lots of ratings are the ones everybody love, so they are not useful as the movies that have few ratings. Considering the movies with a few ratings did not work at all in simple models (like LR). See https://gist.github.com/baharefatemi/e4338625f00e4f623d236dbe0782b81e

Matrix Factorization and Logistic regression

Using matrix factorization, I build new train and test data-sets. These data-sets contain user_ids and the set of features learned for each user via matrix factorization. I build a Logistic regression model over the new data-sets. I test the overall model (matrix factorization and logistic regression) for different k (number of matrix factorization features learned for each user and each movie) for learning rating of a user for each movie. The chart below shows log-likelihood on the test data vs. k. As we see in the chart, we achieve the best log-likelihood when k = 5. The numbers I reported in the results table are the ones with k=5.

This chart shows log loss for different values of k.

I also train this model for rated and rating > 3 data-set. Because of losing some information by ignoring actual rating of a user for the movie, I did not need to test many ks. By docing cross-validation I chose parameters and output results of different models.

Here is the link to the code: https://gist.github.com/baharefatemi/fc4855242ef587a4582c3b9d6e882d6a

I also used Weka codes for Logistic Regression, neural network and PCA models. I optimized the parameters by cross validation.

Future Work

1. Testing neural network for our data-set on a more powerful machine.

2. using PCA or any dimension reduction algorithm as a supervised method.