Random Forest

From UBC Wiki

What is the problem?

Random Forest is a machine learning algorithm used for classification and regression.

We plan to implement the random forest algorithm in Haskell.

Before running the algorithm, the program will:

  • Require the user to provide data (CSV or direct input) Allow the user to specify the number of bootstrapped trees to aggregate (default = 50). The greater the number of trees, the more accurate the resulting model.

After running the algorithm, the program will:

  • Be able to predict new values based on the trained model

What is the something extra?

In addition to implementing the algorithm, we will

  • Implement a CSV reader for ease of data input
  • Implement a predict function
  • Showcase this algorithm by using it to predict whether a person has contracted COVID-19 based on their existing symptoms.

What did we learn from doing this?

Haskell has to directly handle csv files with different types (eg: Int, float etc.) by declaring types in order to parse the csv file, there is no parser module that can directly handle reading a csv file. However, by reading the file, we are able to parse the information, and change it into a list to be later used in the random forest algorithm. We also learn that there are different functions to handle parsing and reading csv files with a header and files without a header.

It turns out that it is very easy to declare new data types in Haskell and write compatible functions. Because of the functional approach, many functions are reusable and transferrable to different data types with slight modifications. All of this allows for us to write very concise code while using minimal high-level functions. We also discovered that randomly sampling is made rather difficult in Haskell, and that the strongly-typed nature results in some unforgiving moments when attempting to coerce different data types.

Link to code