From UBC Wiki

Sentiment Classifier


Dan, Michael, Louise

What is the problem?

How suitable is Haskell for machine learning? We want to use Haskell to train a model to classify user input into different categories.

What is the something extra?

We used a Naive Bayes classifier to train a model and used this model the predict the label for a given phrase.

We currently have two implementations of our classifier. The first takes a phrase and will try to tell you what genre of movie it would best fit in. This model is trained using the Cornell movie dialogue corpus which contains thousands of lines of dialogue from hundreds of movies spanning 24 genres.

Our second implementation tries to simply predict whether a phrase is a positive or negative one. We train this model using a dataset containing thousands of tweets that have been labelled as positive or negative.

What did we learn from doing this?

We were originally going to train a chatbot with recurrent neural networks using Tensorflow but found that Haskell does not support the framework within Tensorflow that we planned to use. Without this framework implementing the recurrent neural networks was infeasible given our timeline, encouraging us to modify our goal.

With our new goal we tried to fit a model to predict the genre of a movie given a phrase from the movie. Some issues that we had for this problem were that there was so many different genres that it was difficult to get an accurate prediction and most of the time the predicted genre was whichever genre was the most common among the training data. From these issues we decided to use a different dataset - tweets which were labelled as positive or negative.

We found that Haskell was suitable for cleaning the data as we were able to get the data in the format we wanted with very concise code.

We also ran into a memory issue during production and discovered that this was due to the way Haskell uses lazy evaluation in particular with foldl and foldr. While there are ways to get around this issue, it is something that has to be kept in mind when using Haskell for large datasets.

Links to code etc