# Course:CPSC522/Alternative Classifiers

## Alternative Classifiers

This page showcases three alternatives to linear classification, namely the Euclidean, the Laplacian and the Mahalanobis classifiers.

Principal Author: Peyman Bateni

## Abstract

In this page, we address challenge the widespread use of the linear classifier in deep learning models by proposing three alternative classifiers to be used in their place. Our classifiers tale advantage of the euclidean distance, the absolute difference, and the Mahalanobis distance for classification and are able to match and marginally exceed the linear classifier in performance. We additionally derive theoretical correspondence between our classifier and more statistically sound mixture models, hence providing a theoretical motivation for using our classifiers and laying the ground work for further theoretical research using our proposed approaches.

### Builds on

This page explores a fairly low-level hypothesis and is therefore, not really building on any of the existing pages.

### Related Pages

More broadly, this page relates to existing work on deep learning with a particular focus on classification.

## Content

### Introduction and Motivation

Visual classification can be effectively described in two steps: distilling images into feature vectors, and then using such feature vectors to assign labels. While the former has received extensive attention in research in past [1, 2], especially with the invention of Convolutional Neural Networks (CNNs), the latter has often been neglected. That is, with a few exceptions, all models have primarily relied on the widely used linear classifier to perform the final labelling stage. This stems from the underlying belief that a powerful enough feature extractor would be able to approximate any classification metric in the feature extraction process, therefore removing the need for augmenting such a process in the second stage. However, recent work in visual classification with limited labelled training data has challenged this belief by showing that the choice of the second stage classifier to be of crucial importance in obtaining optimal performance [3, 4]. While operating with limited labelled data comes with inherently different challenges and concerns, it poses the question of whether linear classifiers should also be re-thought in fully supervised visual classification where labelled data is abundantly available. This provides the motivation behind the work below, proposing three new alternatives to the linear classifier. Our contributions specifically are as follows:

• Three classifiers based on the Euclidean, Absolute, and Mahalanobis distance functions that are able to match the state of the art in performance.
• Theoretical derivation of the correspondence the proposed classifiers to mixture models concerning different distribution functions.

### Problem Definition and Notation

We consider the problem setting where a large set of labelled images ${\displaystyle D_{train}=\{(\mathbf {x} _{i},\mathbb {y} _{i})\}_{i=1}^{n}}$ is available consisting of ${\displaystyle n}$ labelled images that can be used for training our classifier. Similarly, a labelled test set ${\displaystyle D_{test}=\{(\mathbf {x} _{i},\mathbb {y} _{i})\}_{i=n+1}^{n+m}}$of ${\displaystyle m}$ different images is used to evaluate the performance of the model. On a higher level, it's assumed that both these sets are sampled from a true underlying data distribution (i.e.${\displaystyle D_{train},D_{test}\in D}$) for the task at hand. Therefore, fundamentally the goal is to maximise ${\displaystyle \mathbb {E} _{(\mathbf {x} ,\mathbb {y} )\sim D}[p(\mathbb {y} ^{*}=\mathbb {y} |\mathbf {x} ,\theta )]}$such that for any image ${\displaystyle \mathbf {x} }$, the likelihood of the true label ${\displaystyle \mathbb {y} }$is maximised. In practice, this is done by learning ${\displaystyle \theta }$ using of approximation of the aforementioned expectation formulated as a negative log likelihood loss and a regularization term to avoid over-fitting to the training data.

We additionally define our model to consist of two parts: a feature extractor ${\displaystyle f_{\theta }}$ that for an image ${\displaystyle \mathbf {x} }$ produces a feature vector ${\displaystyle \mathbf {z} }$ and a classifier that using the extracted feature vector for an image, produces the probability of the example belong to each class in the task. More specifically, the classifier produces ${\displaystyle p(\mathbb {y} ^{*}=k|\mathbf {z} ,\theta )}$for each class ${\displaystyle k}$ within the task. Note that while our focus here is on visual classification, this framework is often used in other classification tasks such document classification, topic classification, etc. Furthermore, note the fact that our formulation of the problem is agnostic towards the architecture of the feature extractor ${\displaystyle f_{\theta }}$, therefore making our proposed alternative classifiers applicable to other designs than the ones experiments as well.

### A New Take on What's Linear Classification

A linear classifier learns a set of linear weights ${\displaystyle W}$ and biases ${\displaystyle \mathbf {b} }$which are used project the input feature vector followed by a none-linearity, namely the Softmax function, to produce the final probabilities. That is, to obtain all class probabilities for a feature vector ${\displaystyle \mathbf {z} }$, the following is performed:

${\displaystyle p(\mathbb {y} ^{*}|\mathbf {z} ,\theta )={\text{Softmax}}(W\times \mathbf {z} +\mathbf {b} )}$

In this way, a linear classifier can be thought of a one-layer neural approximation to a classification function. However, it can also be thought of as a different procedure. Ignoring the biases for now, we can observe that the linear classifier relies on learning class representations ${\displaystyle W[k,:]}$for each class. The class representation are used as embeddings in the feature space that can be used to classify new feature vectors using some metric which in this case, given the matrix multiplication between ${\displaystyle W}$ and ${\displaystyle \mathbf {z} }$, would be the use of the dot product as a similarity metric. This, therefore, gives rise to the following alternative formulation of linear classification where for feature vector ${\displaystyle \mathbf {z} }$, the likelihood of the example belonging to class ${\displaystyle k}$is defined by:

${\displaystyle p(\mathbb {y} ^{*}=k|\mathbf {z} ,\theta )={\text{Softmax}}(W[k,:]\cdot \mathbf {z} +\mathbf {b} [k])}$

Which is effectively to the following:

${\displaystyle p(\mathbb {y} ^{*}=k|\mathbf {z} ,\theta )={\text{Softmax}}(-d_{k}(\mathbf {z} ,\mathbf {\mu } _{k}))}$

where in the case of the linear classifier, ${\displaystyle d(\mathbf {z} ,\mathbf {\mu } _{k})=-\mathbf {\mu } _{k}\cdot \mathbf {z} +b[k]}$and ${\displaystyle \mathbf {\mu } _{k}=W[k,:]}$. Given this formulation, a naturally occurring question can be posed that is: Can we choose a better distance function and class representation? Well, in the next section we propose three alternatives to answer this very question.

### Alternative Classifiers

#### The Euclidean Classifier

A natural choice for the distance function would be the squared Euclidean distance thanks to its widespread use as a regression loss itself. Specifically, this variation performs classification using ${\displaystyle p(\mathbb {y} ^{*}=k|\mathbf {z} ,\theta )={\text{Softmax}}(-{\frac {1}{2}}(\mathbf {z} -\mathbf {\mu } _{k})^{2})}$ where ${\displaystyle \mathbf {\mu } _{k}}$ is a learned parameter initialised for each class ${\displaystyle k}$by sampling from a uniform distribution of random noise. Note that in this fashion, the classification procedure is end-to-end differentiable and hence, we can learn the class parameters and the feature extractor both together. If you are wondering about the presence of the ${\displaystyle {\frac {1}{2}}}$within the classifier, this addition in fact has very minimal impact on the performance of the classifier itself. However, what this now allows us to show is a direct correspondence between using the squared Euclidean distance as a metric inside a Softmax function and a multi-variate Gaussian Mixture Model with the prior ${\displaystyle \pi _{k}={\frac {1}{K}}}$for each class ${\displaystyle k}$ of the total of ${\displaystyle K}$classes in the task and unit-covariance assumed for each Gaussian distribution.

${\displaystyle p(\mathbb {y} ^{*}=k|\mathbf {z} ,\theta )={\text{Softmax}}(-{\frac {1}{2}}(\mathbf {z} -\mathbf {\mu } _{k})^{2})={\frac {e^{-{\frac {1}{2}}(\mathbf {z} -\mathbf {\mu } _{k})^{2}}}{\sum _{c\in \{1..K\}}e^{-{\frac {1}{2}}(\mathbf {z} -\mathbf {\mu } _{c})^{2}}}}={\frac {(2\pi )^{-{\frac {l}{2}}}e^{-{\frac {1}{2}}(\mathbf {z} -\mathbf {\mu } _{k})^{2}}}{\sum _{c\in \{1..K\}}(2\pi )^{-{\frac {l}{2}}}e^{-{\frac {1}{2}}(\mathbf {z} -\mathbf {\mu } _{c})^{2}}}}={\frac {{\mathcal {N}}(\mathbf {z} |\mu _{k},I)}{\sum _{c\in \{1..K\}}{\mathcal {N}}(\mathbf {z} |\mu _{k},I)}}}$

This unit-covariance assumption is represented by the identity matrix used to describe the covariance estimate for each Gaussian in the last part of the equation above. This simplifying assumption results in a further correspondence between the used of the squared Euclidean distance for classification and density estimation as a parallel task. This is naturally a more theoretically sound choice that the dot product used in linear classifier since now learned class representations can be treated as class means, with a decision space to possesses properties from a Gaussian Mixture Model (GMM).

#### The Laplacian Classifier

Another choice can be the choice of the squared absolute difference used as a measure of distance to perform classification. Arguably only second to the squared Euclidean error in popularity, the absolute error is also often used as a regression loss that unlike the squared Euclidean distance is actually more robust towards outliers but unfortunately, it's not differentiable as a distance measure of zero. In practice this is not a problem as it's incredibly rare for a training example to lay on the class mean exactly. We can therefore perform classification using this new framework by simply producing class probabilities using ${\displaystyle p(\mathbb {y} ^{*}=k|\mathbf {z} ,\theta )={\text{Softmax}}(-|\mathbf {z} -\mathbf {\mu } _{k}|)}$. Similar to the previous section, the use of the absolute difference in the Laplacian classifier, as the name implies, corresponds to performing classification in a mixture model parameterized by Laplacian distributions as oppose to Gaussian as shown below:

Comparison of Laplacian and Gaussian distributions. Figure comes from ResearchGate and was created by Mohsen Jeidi.

${\displaystyle p(\mathbb {y} ^{*}=k|\mathbf {z} ,\theta )={\text{Softmax}}(-|\mathbf {z} -\mathbf {\mu } _{k}|)={\frac {e^{-|\mathbf {z} -\mathbf {\mu } _{k}|}}{\sum _{c\in \{1..K\}}-|\mathbf {z} -\mathbf {\mu } _{c}|}}={\frac {{\frac {1}{2}}e^{|\mathbf {z} -\mathbf {\mu } _{k}|}}{\sum _{c\in \{1..K\}}{\frac {1}{2}}e^{-{\frac {1}{2}}|\mathbf {z} -\mathbf {\mu } _{c}|}}}={\frac {{\text{Laplace}}(\mathbf {z} |\mu _{k},1)}{\sum _{c\in \{1..K\}}{\text{Laplace}}(\mathbf {z} |\mu _{k},1)}}}$
A comparison of Laplacian and Gaussian distributions have been provided in the figure to the right.

#### The Mahalanobis Classifier

The squared Euclidean distance makes a simplifying assumption that class clusters are unit-covariant and de-correlated as enforced by the identity matrix used in place of the covariance matrix in the Gaussian distributions as previously shown. This assumption can be relaxed using the squared Mahalanobis distance where the differences are projected onto such a space by learning a covariance estimate for each cluster and then using a whitening transformation in the middle of the distance function. That is more specifically, the use of the squared Mahalanobis distance turns the production of class probabilities into ${\displaystyle p(\mathbb {y} ^{*}=k|\mathbf {z} ,\theta )={\text{Softmax}}(-{\frac {1}{2}}(\mathbf {z} -\mathbf {\mu } _{k})^{T}Q_{k}^{-1}(\mathbf {z} -\mathbf {\mu } _{k}))}$ where ${\displaystyle Q_{k}}$is the covariance estimate for class ${\displaystyle k}$. Note that when all ${\displaystyle Q_{k}}$ are set to be the identity, the classifier becomes identical to that of the Euclidean classifier. Note that this method of classification corresponds to an full Gaussian Mixture Model (GMM) but approximately as shown below:

${\displaystyle p(\mathbb {y} ^{*}=k|\mathbf {z} ,\theta )={\text{Softmax}}(-{\frac {1}{2}}(\mathbf {z} -\mathbf {\mu } _{k})^{T}Q_{k}^{-1}(\mathbf {z} -\mathbf {\mu } _{k}))\approx {\text{Softmax}}(-{\frac {1}{2}}(\mathbf {z} -\mathbf {\mu } _{k})^{T}Q_{k}^{-1}(\mathbf {z} -\mathbf {\mu } _{k})-{\text{log}}({\text{Det}}(Q_{k})^{-{\frac {1}{2}}}))={\frac {e^{-{\frac {1}{2}}(\mathbf {z} -\mathbf {\mu } _{k})^{T}Q_{k}^{-1}(\mathbf {z} -\mathbf {\mu } _{k})-{\text{log}}({\text{Det}}(Q_{k})^{-{\frac {1}{2}}})}}{\sum _{c\in \{1..K\}}e^{-{\frac {1}{2}}(\mathbf {z} -\mathbf {\mu } _{c})^{T}Q_{c}^{-1}(\mathbf {z} -\mathbf {\mu } _{c})-{\text{log}}({\text{Det}}(Q_{c})^{-{\frac {1}{2}}})}}}={\frac {(2\pi )^{-{\frac {l}{2}}}e^{-{\frac {1}{2}}(\mathbf {z} -\mathbf {\mu } _{k})^{T}Q_{k}^{-1}(\mathbf {z} -\mathbf {\mu } _{k})-{\text{log}}({\text{Det}}(Q_{k})^{-{\frac {1}{2}}})}}{\sum _{c\in \{1..K\}}(2\pi )^{-{\frac {l}{2}}}e^{-{\frac {1}{2}}(\mathbf {z} -\mathbf {\mu } _{c})^{T}Q_{c}^{-1}(\mathbf {z} -\mathbf {\mu } _{c})-{\text{log}}({\text{Det}}(Q_{c})^{-{\frac {1}{2}}})}}}={\frac {{\mathcal {N}}(\mathbf {z} |\mu _{k},Q_{k})}{\sum _{c\in \{1..K\}}{\mathcal {N}}(\mathbf {z} |\mu _{k},Q_{c})}}}$

The latter term, namely the log-determinant was found to introduce faulty gradient signals in the beginning, making training unstable. Due to this fact, it was decided against inclusion and hence experiments were conducted using the base Mahalanobis variation. Similar to the learning of the means for the previous two classifiers, the covariance matrices can initialised here as parameters and learned on the fly. However, using random noise or a zero matrix essentially renders the model unable to learn anything effective. Instead, we initialise the covariance estimates using the identity matrix to resemble the Euclidean classifier in the early stages of training.

### Experiments

We perform experiments on the CIFAR10 and CIFAR100 datasets that respectively consist of 10 and 100 classes of images. We used a ResNet20 [2] as the feature extractor under two settings, one initialised and learned from scratch and the other initialised using pre-trained weights. The results are shown below:

Comparing accuracies (%) of different classifiers on CIFAR10 and CIFAR100.
Dataset ResNet Initialisation Linear Classifier Euclidean Classifier Laplacian Classifier Mahalanobis Classifier
CIFAR10 Scratch 91.51 92.08 91.83 89.66
CIFAR10 Pretrained 85.79 84.55 85.45 62.86
CIFAR100 Scratch 67.29 67.95 66.18 NA
CIFAR100 Pretrained 55.99 55.3 59.88 NA

As shown, the Euclidean classifier and the Laplacian classifier are able to match and in some cases outperform the linear classifier, although marginally. This is very interesting as in low data settings, the perform is actually large, but it seems that when given enough data, the feature space can separate well enough for these small differences not to matter. However, the theoretical properties are still very useful and the matching performance can potentially motivate the usage of those two as alternative to the linear classifier. The Mahalanobis classifier, however, lags behind in performance. It seems like learning a covariance estimate for every class while training the feature extractor as the same time can be unstable and therefore, become sub-optimal in performance. Due to limitations of time and the fact that the Mahalanobis classifier proved to lag in performance on CIFAR10, CIFAR100 experiments were not performance with that classifier. Additionally, as perhaps expected, using pre-trained initialisation generally lowers performance. That said, since fine-tuning using pre-trained weights, especially in this case where the weight come from a model pre-trained on ImageNet that actually uses a linear classifier, can be very unstable, we see a larger loss in performance for the Mahalanobis classifier compared to the other three when comparing "from scratch" training to that of using pre-trained weight for initialization.

### Future Work

We hope that this page serves as a motivating element for future work on alternative classifiers, including those described here. Furthermore, research into better end-to-end learning of covariance-based classifier can be potentially useful in the future. A potential avenue in this direction could include iterative learning of covariance and mean estimates using their respective closed-form definitions which would guarantee the theoretical properties of a covariance matrix (such as being positive semi-definite) while still allowing end-to-end training of the model.

## Annotated Bibliography

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems,pages1097–1105,2012.

[2] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385,2015.

[3] P. Bateni, R. Goyal, V. Masrani, F. Wood and L. Sigal. Improved Few Shot Learning. arXiv preprint arXiv:1912.03432, 2019.

[4] J. Snell, K. Swersky, and R. S. Zemel. Prototypical Networks for Few-shot Learning. arXiv preprint arXiv:1703.05175, 2017.