# Course:CPSC522/Online Pattern Analysis by Evolving Self-Organizing Maps

## On-line pattern analysis by evolving self-organizing maps

The evolving self-organizing map enhances the original self-organizing map by taking inspiration from the work on growing neural gas, which explores an implementation of a model that learns topology of the input without a predetermined output model size.

1. Fritzke, Bernd (January 1994). "A Growing Neural Gas Model Learns Topologies" (PDF). Proceedings of the 7th International Conference on Neural Information Processing Systems: 625–632.
2. Deng, Da; Kasabov, Nikola (April 2003). "On-line pattern analysis by evolving self-organizing maps". Neurocomputing. 51: 87–103.

Principal Author: Svetlana Sodol

## Abstract

This page describes two papers that propose models that incrementally grow in size the output models they produce - growing neural gas and evolving self-organizing map. The latter is inspired by the first and the papers presented are separated in time by almost 10 years. The evolving self-organizing map borrows the basic structure and approach of the growing neural gas while introducing changes like making insertion of new nodes faster and keeping track of a smaller number of model parameters. The evolving self-organizing map work also provides more context for usage of such an incremental learning model in an online environment for classification and clustering.

### Builds on

The concepts in this page are based on models and algorithms of SOM, neural gas, competitive Hebbian learning and vector quantization.

### Related Pages

Some algorithms that are similar in nature or tasks are PCA, k-means and neural networks.

## Content

### Introduction

The paper "On-line pattern analysis by evolving self-organizing maps" attempts to solve the problem of processing online data streams. This could be a classification or a clustering task of a dataset that is presented to the model example by example, not all together. The inputs and outputs might come from unknown distributions and change over time. This means that the parameters of any model need to be updated incrementally with limited memory resources. Authors point out that this requires efficiency and flexibility and thus many of the models that suit these tasks are neural network models. Previously used models are critiqued as having predetermined sizes so there is open demand for models that do not need their sizes predetermined before any data is seen. An evolving model titled "growing neural gas" (GNG) has been proposed by Fritzke[1] and was based on the model of the "neural gas" (NG) , which is a very similar concept to the SOM. This showed that fast incremental learning of not only the structure of the output model but also the size of it, is possible and so the growing models of Fritzke are taken as the inspiration to the evolving self-organizing map (ESOM)[2]. The fact that these algorithms all go through the input examples in succession allows this to be used for streaming or online settings. Although this does not limit the usage to only the online realm, as any dataset can usually be streamed as was done for many benchmarks that will be described.

In the following sections, first, the initial paper by Fritzke on the growing neural gas will be described. Then, the more recent paper on the evolving SOM's. The last section will outline some of the comparisons and critiques of the two papers, as well as provide a bigger picture on the contributions.

### Growing Neural Gas[1]

#### Introduction

"A growing neural gas model learns topologies" paper presents an unsupervised learning approach to solving dimensionality reduction. The main idea of dimensionality reduction is described by the authors as "finding low dimensional subspace of the input vector space containing most or all of the input data”.

Some already existing solutions are to use direct methods like PCA or iterative methods like self-organizing maps (SOM)[3] and growing cell structures (GCS)[4]. However the problem with these methods is that the dimension size of the output feature map for SOM or GCG (or the number of principle components for PCA) have to be specified ahead of time. There has been another method titled "neural gas" (NG) described in 1991, that uses competitive Hebbian learning (CHL) to find topological structure that closely reflects how the input data is distributed. This paper proposes a new network model that also uses CHL and is based on NG, but is incremental and only has constant parameters, thus offering advantages over other models.

#### Background

The competitive Hebbian learning (CHL) principle is “for each input signal x connect the two closest centres (measured by Euclidean distance) by an edge”. The centres here are like the cluster centres in k-means - they form the actual output model that we want to produce, to represent the distribution of the input data.

We start by assuming a certain number of centres in ${\displaystyle \mathbb {R} ^{n}}$ and then iteratively insert topological connections - edges of the resulting network. The resulting graph optimally preserves topology in a very general sense. This is titled "induced Delaunay triangulation". Since the resulting graph is limited to where the input distribution is greater than zero, for most usefulness we should only place the centres at these locations by using vector qunatization (VQ).

The neural gas (NG) method is a form of VQ. The principle is as follows: “for each input signal x adapt the k nearest centers whereby k is decreasing (through time) from a large initial to a small final value”. This allows to move the nodes of the resulting network closer to the places of interest where the input data is actually coming from.

NG and CHL can be run in succession or concurrently, as although CHL does not influence NG, NG influences CHL. Where we have the centres matters for which centres we find to be the closest to the input data and therefore which we connect to form the topology. Although this has been shown effective in topology learning, in practice the problem occurs with knowing the correct number of centres ahead of time in order to have accurate learning.

#### Algorithm

In order to avoid specifying the final number of centres for the model, the authors introduce the "growing neural gas" algorithm, that successively adds new centres to an initially small network by evaluating local statistics. This is a similar approach to GCS[4] but with dimensions dependent on input, which may vary locally.

Input: possibly infinite n-dimensional data set of a certain distribution

Output: a network that consists of set of nodes ${\displaystyle A}$ (the centers), each node has reference vector in ${\displaystyle \mathbb {R} ^{n}}$ space, and ${\displaystyle N}$ unweighted edges among pairs of nodes – to define the topological structure

Some of the parameters that are important for the algorithm:

- "ages" for the connections that increment with each iteration the edge and its nodes have not been used and delete the edges if they are older than ${\displaystyle a_{max}}$

- every node has its reference vector and an accumulated error

- ${\displaystyle \lambda }$parameter that controls how often we add new nodes to the network

- parameters ${\displaystyle e_{b}}$ (for the winner node) and ${\displaystyle e_{n}}$ (for the winner's neighbours) that control how these nodes' reference vectors are moved towards the incoming data

- ${\displaystyle \alpha }$ parameter that controls how the new nodes' errors are initialized

- ${\displaystyle d}$ parameter that controls how much the accumulated error for all nodes decreases with each iteration

Algorithm:

1. Start with 2 nodes at random positions in ${\displaystyle \mathbb {R} ^{n}}$ space
2. Sample input distribution for an example ${\displaystyle x}$
3. Find the nearest 2 nodes to the input example - ${\displaystyle s_{1}}$ and ${\displaystyle s_{2}}$
4. Increment age of all edges from ${\displaystyle s_{1}}$
5. Update error accumulator for ${\displaystyle s_{1}}$: ${\displaystyle \Delta \;error(s_{1})=||s_{1}-x||_{2}}$
6. Move ${\displaystyle s_{1}}$ and all its neighbors towards input example by ${\displaystyle \Delta \;node=e*||node-x||_{2}}$, where the parameter ${\displaystyle e}$ is ${\displaystyle e_{b}}$ for ${\displaystyle s_{1}}$ and ${\displaystyle e_{n}}$ for rest
7. If ${\displaystyle s_{1}}$ and ${\displaystyle s_{2}}$ are connected by an edge, set age to 0, otherwise create it
8. Remove edges with age larger than ${\displaystyle a_{max}}$, if nodes have no edges delete them
9. Every ${\displaystyle \lambda }$ input examples, add new node: Determine node ${\displaystyle q}$ with largest accumulator error; Insert new node ${\displaystyle r}$ halfway between node ${\displaystyle q}$ and its highest error neighbor node ${\displaystyle f}$; ${\displaystyle r=0.5*(q+f)}$; Connect ${\displaystyle r}$ to ${\displaystyle q}$ and ${\displaystyle f}$, remove original edge between ${\displaystyle q}$ and ${\displaystyle f}$; ${\displaystyle error(r)=error(q)=\alpha *q;error(f)=\alpha *f}$
10. Decrease all errors by multiplying with parameter ${\displaystyle d}$
11. Repeat from step 2 until done

#### Simulation Results

This GNG model was tested on a dataset used for original NG and the results are that GNG quickly learns important data topology structures.

This test had 600-20000 inputs and the resulting network had 100 edges between nodes.

The 2nd test compared GNG to NG. The final topology learnt was similar, with both models having 100 nodes after 10,000 steps. However, intermediate networks were different. In the image to the right you can see the snapshots of this evaluation at different stages of learning. The left-hand column is the NG model and the right-hand is the GNG. The input data structure is the big cirles and the block in the centre, while the small cirles represent the model nodes and the lines are the connections. One can see how the GNG model can also take the shape of the input data distribution through time while increasing the number of used nodes. This image is taken from the paper by Fritzke[1].

This figure shows how the neural gas and the growing neural gas models imitate the structure of the input data

### Evolving Self-Organizing Maps[2]

Here we describe the second paper "On-line pattern analysis by evolving self-organizing maps". The structure of the ESOM network is as in the GNG. It builds on the idea that learning the size of the topology network can be done incrementally. The main contribuion of this paper is the implementation of the evolving topology network and showing emperical results that this is competitive to other clustering and classification methods. It is importnant to note that the evaluations are not actually performed "on-line", the datasets that are used are fixed and are well-exlored. However, the datasets are treated as if they are online since they are provided to the model one by one.

#### ESOM

It starts with an empty network embedded in input space, new nodes are created incrementally and with each new input its 2 winner nodes get connected. If no nodes match with current input a new node is created.

The insertion of new nodes is different from the GNG to reduce the number of iterations needed to tune the new node to become a new cluster - instead of the insertion at halfway point, the new node is created at exactly the poorly matched input value.

To prevent this from introducing noise and artifacts, careful automatic deletion of obsolete nodes is completed. The main goal of this model is life-long learning, so strict convergence is not critical. The learning rate ${\displaystyle \gamma }$ is set constant, usually of 0.05, the forgetting constant ${\displaystyle \beta }$ (from Hebbian learning with forgetting) to 0.8 and the distance threshold ${\displaystyle \epsilon }$ that controls node insertion and thus cluster formation is determined by trial and error. This constant represents the maximum difference between a node and an input that can qualify it as a match. Another change from GNG is that there is no longer tracking of errors for all the nodes.

Algorithm (some of the variable names are changes to indicate the similarities to the GNG algorithm more explicitly):

1) Sample new input data vector ${\displaystyle x}$

2) If there are no nodes go to INSERTION

3) MATCHING: find 2 best-matching nodes ${\displaystyle s_{1}}$ and ${\displaystyle s_{2}}$, if ${\displaystyle ||s_{1}-x||<\epsilon }$go to UPDATING, otherwise to INSERTION

4) INSERTION: create new node representing the input, connect with 2 best matching existing nodes

5) UPDATING: modify winners ${\displaystyle s_{1}}$and ${\displaystyle s_{2}}$and their neighbors that have positive connection edges with the learning rule of ESOM:

${\displaystyle \Delta node_{i}=\gamma {\frac {a_{i}}{\sum _{k}a_{k}}}(x-node_{i})}$

where ${\displaystyle a_{i}=exp(-2||x-node_{i}||^{2/\epsilon ^{2}})}$

and update their edge connection strengths by:

${\displaystyle s(i,j)=\beta s(i,j)+(1-\beta )a_{i}a_{j}}$

6) DELETION: every ${\displaystyle \lambda }$ steps prune the edge with the weakest strength

7) Repeat from step 1 until there is no more data

The time complexity of this is comparable to GCS/GNG and is more efficient than original NG.

#### Simulations

ESOM is shown to be competitive with other approaches to clustering and classification.

4 different simulations were performed to evaluate the ESOM model.

1) Colour image quantization as online data clustering

The goal of this task is to reduce number of colours in a digital colour image, this is closely related to image compression. The problem is usually that this is time-consuming, and so an "online" solution would perform faster. ESOM was compared to the two basic approaches to this task as well as to k-means and Wu's method on 3 images popular in the field. ESOM had the best results for 2 images and had close performance to Wu's method on the last one. It was also noted that ESOM-produced images have better visual quality.

2) 2-spiral coordinate data classification

ESOM is unsupervised, but can be applicable in supervised setting for classification using the 1 nearest neighbor approach. In the 2-spiral data set the inputs are coordinates. ESOM was shown to need the fewest number of centres to get 0% error rate (about 100 centres), and required only a single pass through data set. It was compared to GCS, LVQ and SOM.

3) UCI Pima Indian diabetes set

The data in this benchmark is personal patient data and a binary classification indicating if they are positive for diabetes or not. Training was done with 10-fold cross validation, with data presented 5 times in random order, and average performance was compared. Compared to k-nearest neighbors, class/regression trees, MLP, LVQ and LDA, ESOM had pretty similar results. As compared to GNG/GCS, the ESOM only required one pass through the data for similar accuracy, however, GNG needed less nodes (but more passes through the data).

4) CMU vowels data set

The input data for this set are 990 frames of speech signals, from 4 males and 4 females. The training was performed offline to set the parameters. It was then done online and compared to perceptrons, nearest neighbours methods and GGS. ESOM required the least number of weights for similar performance, and only single pass learning.

#### Conclusions

ESOM can be extended to continue training on-line (as new examples keep coming in): authors predict that this will reduce the number of errors on diabetes and the vowels sets. Better or similar performance was shown on the benchmarks, with many advantages like single-pass learning. Compared to regular SOM, the results are better, with less time needed to train, and more visual clues are present in the resulting network, as better clustering/proximity is found. However, visualization is not usually straight forward as the output is in the original input space.

Future directions are outlined to be extending ESOM for use in information retrieval through threshold scales - doing hierarchical feature maps, using aggregation to reduce the number of nodes and self-tuning of system parameters.

Possible applications are multimedia information retrieval and online financial data analysis.

### Discussion

#### Contributions

The authors of the "On-line pattern analysis by evolving self-organizing maps" paper are very explicit about the similiarities and differences with the GNG model and algorithm, although they use sometimes different terminology and notation. The main idea is to use the concept that online learning needs incremental change in the network and the fact that this is shown to work with NG already. So they try a very similar thing with a SOM, to increase its fit to the online learning domain, where the input data is not present for pre-configuring of the model space. This is not a very straight forward or easy extension, as there are many changes and they also talk a lot about NG and GCS and some other models that share some aspects with ESOM. It seems like the authors are putting together little things from many already explored and implemented places and apply to a somewhat new setting.

#### Evaluations

For the evaluation of the ESOM model, some results to compare to are taken from other papers, not reproduced for this study. This is a weak point about the evaluation, as there might be some important details missing in the other implementations that could impact these comparisons.

However, ESOM is tested in much more detail than as the GNG presented in "A Growing Neural Gas Model Learns Topologies", with more sets, compared with more algorithms, on more diverse data. This might be the result of more work being done in this field and simply more available data available that is freely accesible and has already became a benchmark in the field.

#### Improvements

For the first paper there are still many unanswered questions. At one point the authors mention that "reversible mapping does not exist in general, so some information can be lost"[1]. They go on to talk about topology learning from the persepctive of finding out what structures can be mapped reversably. This is however not brought up into the discussion again and remains unswered. This, however, is very important for talking about accuracy of topology learning models - a low accuracy on a less "reversable" dataset might be the best any model can do, even though it is not all that well. Some theoretical proofs of performance and how much topology can be saved for sure in these methods would also be interesting to see.

Some other things missing from the first paper are how exactly the GNG model gets locally different dimensions dependent on the data. This is a very interesting concept but it is not extended in the second paper at all. The emperical evaluation of the GNG in the original paper is also small, it was done only on 2 simple datasets and thus many of the promised properties were not shown emperically.

Both papers would benefit from including more background and discusion on parameters and their tunings, their effects on speed and accuracy.

The papers that were chosen do not seem to be the final result of working on these types of problems and models, but only the introductions. More follow ups exist with further extensions, comparisons and testings. This highlights that research is often a work in progress and there is always room for improvements.

## Annotated Bibliography

1. Fritzke, Bernd (January 1994). "A Growing Neural Gas Model Learns Topologies" (PDF). Proceedings of the 7th International Conference on Neural Information Processing Systems: 625–632.
2. Deng, Da; Kasabov, Nikola (April 2003). "On-line pattern analysis by evolving self-organizing maps". Neurocomputing. 51: 87–103.
3. Kohonen, T. "Self-organized formation of topologically correct feature maps". Biological Cybernetics. 43: 59–69.
4. Fritzke, B (1994). "Growing cell structures - a self-organizing network for unsupervised and supervised learning". Neural Networks. 7 (9): 1441–1460.