From UBC Wiki

Bag of Words Spam Detection

Authors: Andrew Bolyachevets, David Karchynski and Trueman Lam

What is the problem?

We implement a variation of Bag of Words spam detection algorithm. This involves sourcing, sanitizing and parsing training data used to construct a corpus of unique elements. This corpus is then used to vectorize new unclassified sentences and label them as "spam" or not accordingly.

What is the something extra?

The program loop asks for user to select the value of n to be used in constructing n-gram corpus, classification strategy (Naive Bayes or Cosine Similarity) and file name. Intermediate data matrices are cached during the first iteration of the program making classification faster on successive runs. Our implementation takes advantage of some external code: sparse vector implementation from Data.Sparse.SpVector and Porter Stemming Algorithm (see the code for references). Besides Porter Stemming, we use term-frequency-inverse document frequency to eliminate uninformative stems (i.e., present in most documents).

What did we learn from doing this?

During the course of project implementation we encountered several difficulties inherent to Haskell's design (due to lazy evaluation and immutability). In particular, we had to alter our data structures and algorithms to tackle memory leaks when processing non-trivially large data file to train our model. While immutability allows significant optimizations by the compiler, it makes working with IO a challenge. Potentially streams or stream managing libraries like pipes or conduit need to be employed for data processing in production environments.

Links to code etc