Course:CPSC312-2023/kNN-iris
Authors: Joy, Jonathan
What is the problem?
The problem this Prolog project addresses is how to identify and categorize iris flowers from the iris database using the k-nearest neighbors (KNN) algorithm. The iris database contains measurements of the sepal length, sepal width, petal length, and petal width of 150 iris flowers belonging to three different species: setosa, versicolor, and virginica. The goal is to develop a program that can predict the species of an iris flower based on its measurements. Some example entries of the dataset are displayed below.
sepal_length | sepal_width | petal_length | pedtal_width | species |
---|---|---|---|---|
5.1 | 3.5 | 1.3 | 0.2 | Setosa |
4.9 | 3 | 1.4 | 0.2 | Setosa |
6 | 2.9 | 4.5 | 1.5 | Versicolor |
7.7 | 3 | 6.1 | 2.3 | Virginica |
What is the something extra?
Our kNN algorithm allows the user to choose the number of nearest neighbours (K) to calculate for each prediction and outputs the predicted class which is the most frequent species appearing in the nearest neighbour list. It also prints the euclidian distance of the nearest neighbours in the output list.
We created 2 "extra" parts for this project. The first was writing code to compute the 5 fold cross-validation score of our kNN classifer to test it's accuracy. For example, we can compute what the CV score would be if we chose an odd integer for the kNN k value. The program will tell us the average cross validation score across the 5 folds. We chose 5 folds for the partitioning of data as the iris dataset has 150 instances and a partition of 5 would split each test data to have 30 instances which we thought is a very reasonable number.
For example a call to
cross_validation(K, Accuracy)
such as
?- cross_validation (3, Accuracy)
will output the average CV accuracy score of the kNN algorithm using 3 nearest neighbours of a 5 fold cross validation.
Additionally, we wrote unit-tests using PLUnit. These can be run through the command line.
?- [test].
?- run_tests.
What did we learn from doing this?
- Prolog is a powerful programming language that is well-suited for implementing some machine learning algorithms, although lacking in many built in functions such as the ones found in python's scikit-learn library (such as preprocessing of data or PCA) being able to code the kNN model directly and explicitly in prolog was very rewarding
- KNN is a simple yet effective machine learning algorithm that can be used for classification and regression tasks, our model generally performed at 96% cross-validation accuracy score for a neighbours value of 3.
- You can make predicates in Prolog that can be used to generate a list of solutions to a query like the findall predicate in our project. Another important takeaway is the use of Prolog's built-in predicates, such as findall, sort, and length, to perform operations on lists of data points.
- We also learned how to perform cross-validation to evaluate the performance of the model. It was interesting to compute cross-validation accuracy score from scratch in Prolog without relying on built-in libraries such as the scikit-learn library in Python. Cross-validation is a widely used technique for evaluating machine learning models, as it provides a more accurate estimate of the model's performance than a simple train-test split. To make sure our model was performing correctly, used the cross-validation scores for different k values and folds from our prolog project and compared to the ones produced by python scikit-learn library. They complemented each other with minor differences due to the ransom shuffling of data when performing the splits, meaning that our kNN model works as expected.
- We also learned how to debug our program by adding print statements to helper functions. For example when we were creating the cross-validation test, when given a neighbour value of 1, we kept on getting an accuracy score of 100% for every fold. It was only when we printed out our train and test folds did we realize that we called kNN on the entire dataset and not the training set when evaluating the scores causing the accuracy scores to be incorrect. Thus, our "extra" to the prolog project of computing cross-validation scores exemplifies the importance of testing and evaluating machine learning models to ensure their effectiveness.
- Writing unit tests in Prolog was very similar to writing unit tests in Haskell, both of which were easier than writing tests in more common programming languages such as Java or C++. Prolog's lack of control flow features highly encouraged us to write smaller predicates and rules which all had single responsibilities. These types of predicates are very similar to pure functions in functional programming because they don't modify anything in the global environment so we only have to test what the predicate gave us. Initially we had thought there would be more difficulty since it's possible to have more than 1 answer to a query, but PLUnit has a built-in feature called [nondet] which checks every case. (See our count_occurrences_test test suite)
Links to code etc.
https://github.com/zhaoqiaojin/KnnIris