# Course:CPSC 522/Progressive Neural Network

## Title

Progressive Neural Networks

## Abstract

The field of continual learning has presented various methods for an artificial neural network to effectively learn multiple tasks sequentially. Specifically, this paper will discuss Progressive Neural Networks, a method that achieves continual learning by adding a new network to the original artificial neural network when new tasks are presented.

### Builds on

PNNs use a continual learning approach on Neural Networks. PNNs are intended for Deep Reinforcement Learning networks, a subfield of Reinforcement Learning.

### Related Pages

One of the main goals of PNNs is to transfer information between systems. Transfer Learning with Markov Logic is another method to achieve the same goal.

## Content

### Introduction

Current artificial neural networks(ANN’s) have been extremely successful in outperforming humans in a singular task like image classification, Go or chess. However, most ANN’s fail to learn a new task without becoming inadequate in the original task when learning multiple tasks sequentially[1]. This problem is referred to as the “stability-plasticity” dilemma, where stability refers to the ability to learn new tasks and stability refers to the ability to perform previous tasks. In response, many papers have explored methods to achieve continual learning, a learning method that enables an ANN to learn new tasks without becoming inadequate in previous tasks. This paper will provide a short overview of continual learning and the different approaches used, diving more into detail in the progressive neural networks method.

### Background

Continual learning can be precisely defined as learning from an 'infinite stream of data' that changes domains and associated tasks; in this general setting, at each time step ${\displaystyle t}$, a system(eg. a neural network) recieves new samples {${\displaystyle {x_{t},y_{t}}}$}(please note that these are set of samples) from a different task and has to learn a function(with parameters ${\displaystyle \theta }$) that minizes the loss ${\displaystyle l}$ of this task without becoming inadequate in the previous tasks[2]. This can be represented mathematically as follows[2]:

${\displaystyle \theta ^{t}=\operatorname {argmin} _{\theta }l(f(x_{t};\theta ),y_{t})}$
${\displaystyle \operatorname {s.t.} l(f(x_{i};\theta ),y_{t})\leq l(f(x_{i};\theta ^{t-1}),y_{t})}$
${\displaystyle \forall _{i}\in [0..t-1]}$
The objective function can be manipulated to allow performance degredation in previous tasks, however this does not apply to progressive neural networks.

There are many methods used in literature to achieve continual learning; these methods can be grouped under three different categories[1]:

• Replay Methods
• Regularization Methods
• Parameter Isolation Methods

This page will focus on parameter isolation methods, the approach used by progressive neural networks. Nonetheless, other methods will be discussed briefly. In replay methods, real or generated data from previous tasks are replayed periodicly while learning a new task to reduce 'forgetting' of previous tasks[1]. However, as only a small subset of the data is being replayed, replay methods may lead to overfitting this small subset[1]. Moreover, replay methods require storage of data, which might not always be possible due to privacy or other concerns. In circumstances where data storage and replay are not available, regularization methods can account for past tasks by introducing a regularization term in the loss function[1]. This term allows for the retention of past knowledge while learning new tasks.

Lastly, parameter isolation methods aim to segregate the systems/subsystems(for instance neural networks/neural network layers) responsible for each task, thereby eliminating the possibility of forgetting previously learned tasks. Parameter isolation methods can be split into two approaches, fixed architecture, and dynamic architectures[1]. In fixed architecture approaches, the subsystems resposible for a task are identified and masked when training the system on a new task. Meanwhile, dynamic archeticture approaches 'freeze' the systems parameters, and adds new branches to the system or a new replicate system to learn the new task[1]. Nonetheless, dynamic archetictures that add a new system to learn a new task, have to utilize effective and effecient knowledge transfer between all systems to improve convergence time.

### Progressive Neural Networks

Progressive Neural Networks(PNNs) is a continual learning method that uses a dynamic architecture approach on neural networks. In PNNs, a neural network is trained on a task until convergence; the parameters of the network are frozen and a new randomly initialized neural network is linked to the previous network/s through lateral connections[3]. The new network is then trained until convergence. The lateral connections enable transfer learning and improve convergence speed [3]. An example architecture can be seen in the figure below.

Fig1. An example architecture of a PNN learning two tasks.

#### Definitions and Notations

PNNs will be explained in detail in this section with the help of mathemetical notations. Please note that the notations follows [3]. Each neural network consists of ${\displaystyle L}$ layers: an input layer ${\displaystyle l_{0}}$, hidden layers ${\displaystyle l_{i}}$and the output layer. Each layer contains ${\displaystyle n_{i}}$units, and a weight matrix ${\displaystyle W_{i}^{k}}$, where k is the number of the neural network. When a new neural network is added, the outputs from layers ${\displaystyle l_{i-1}}$of the previous networks are given as inputs to layers ${\displaystyle l_{i}}$of the new network(lateral connections in Fig1). These lateral connections are represented by ${\displaystyle U_{i}^{k:j}}$. Thus, the inputs into a layer of a new neural network can be generally represented as follows[3]:

${\displaystyle l_{i}^{k}=f(W_{i}^{k}h_{i-1}^{k}+\sum _{j
where ${\displaystyle f(x)=\operatorname {max} (0,x)}$introduces element-wise non-linearity.

#### PNNs in Reinforcement Learning

PNNs can be used on a variety of neural network architectures such as convolutional neural networks, recurrent neural networks and deep reinforcement networks. In deep reinforcement networks(such as deep Q networks and A3C), each neural network is trained to solve a particular Markov decision process(MDP)[3]. The ${\displaystyle k}$-th network defines a policy ${\displaystyle \pi ^{k}(a|s)}$, where ${\displaystyle a}$ is the action taken given state ${\displaystyle s}$; this policy is sampled at each time step to perform an action and yield the subsequent state.

#### Limitations

PNNs are able to perform extremely well in all tasks learned, however they have 3 main limitations. Firstly, when a new neural network is added, it is connected laterally to all previous networks, thus, the number of parameters increase quadraticly with the number of tasks. Secondly, analysis of PNNs have shown increased under-utilization of new networks as the number of networks increase[3]. This shows that new neural networks(${\displaystyle k=2,3,4,...}$) need to have fewer layers or units as ${\displaystyle k}$ increases. Lastly, the task label must be known when using PNNs inorder to select the appropriate neural network outputs.

## Annotated Bibliography

[1] M. De Lange et al., “Continual learning: A comparative study on how to defy forgetting in classification tasks,” ArXiv190908383 Cs Stat, Sep. 2019.

[2] R. Aljundi, “Continual Learning in Neural Networks,” ArXiv191002718 Cs Stat, Oct. 2019.

[3] A. A. Rusu et al., “Progressive Neural Networks,” ArXiv160604671 Cs, Sep. 2016.

[4] D. T. Tran and A. Iosifidis, “Learning to Rank: A Progressive Neural Network Learning Approach,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 8355–8359, doi: 10.1109/ICASSP.2019.8683711.