Course:CPSC312-2024/PartOfSpeechClassifier

Authors: Leo Gao, Kosmo Ng

What is the problem?

We want to try our hands at making a simple Part Of Speech Classifier. The machine learning model we will use is a Markov Model (HMM). We think Haskell could be good to use since we need a large table or dictionary for the HMM and Haskell's lazy evaluation would make this more efficient. To make our task simpler, we are limiting our predictions to only include previously seen vocabulary from the corpus. Moreover, we are limiting the length of prediction to 1-2 words only since the runtime is currently not optimized.

What is the something extra?

We added a read and write model function so that we can also use other languages to parse or create the model. This is done after we realized building our model in Haskell was not optimal. We used this function to put the model build with python in.

What did we learn from doing this?

This is our first time implementing a machine learning model from scratch and also working with such large scale data. The main thing we found out while doing the project was the lack of support for machine learning in the Hackage library. Our original plan of using the Data.HMM model for the backend was not possible since it is no longer compatible. Other possible libraries that were unfortunately depreciated include Learning.HMM. Hasktorch is currently a project that aims to introduce PyTorch into Haskell. However, it is experimental and not recommended for use. We also learnt, unfortunately, the inability to parse large amount of data quickly in Haskell without using a library. Our datatype for the Matrix did not use a library that is optimized and instead just . Our runtimes for training and predicting on a larger dataset takes up to a day. In contrast, in python, making the model with the same dataset takes merely a minute.

Work division

Leo defined the types and initialization of the matrix and the model. He also wrote the prediction for the model. In addition, Leo also wrote a python function for training the model that can be read into the model.

Kosmo wrote the training (probability calculations) for the model. He also wrote the read, write functions and serialization of the Model.

Links to code etc.

https://github.com/kosmong/SyntacticTagger.git