Crime Incident Cluster Analysis
Authors: Jimmy Lin, David Bole Ma Code Source: https://github.com/Jimmy-Lin/k_means.hs
What is the problem?
We will be performing K-means cluster analysis over the geolocation of crime incidents in Vancouver using this dataset: http://data.vancouver.ca/datacatalogue/crime-data.htm
What is the something extra?
In the process of creating this cluster analysis program, we will also create a CSV reader and a generalize K-clustering algorithm.
Due to the nature of Haskell, we can write a generic CSV reader, then we can partially apply this function to select the data attributes relevant for our own purpose. This makes the code highly reusable for others who want to work with CSV data. Similarly we implemented a series of higher-order functions for K-clustering algorithms that allows you to specify your own vector norms/distance functions as well as centroid computation strategies (eg. Mean, Median, Mode, etc.) These abstractions allow others to easily extend the program for their own purposes.
For example: It would be trivial to modify the program to perform K-median clustering. We would only need to add a median and L1-norm function.
Further, the program is implemented as a script as opposed to an interactive program. That means it would be easy for another program to interact with it, parse the output and do further work with the results as part of another application.
What did we learn from doing this?
Haskell's paradigm treats state in it's own very restricted manner. For example, IO operations have to be confined in an IO monad. Our iterative K-means algorithm has to be reformed into a recursive equivalent. While this is possible, and helps better separate what and isn't part of the state of the algorithm, it can sometimes make implementation slower since many algorithms are still being described procedurally by convention.