## Recurrent Neural Networks

Recurrent Neural Network is a type of Artificial Neural Network that has the ability to recognize complex patterns in input data such as text, handwriting, music, spoken word and time series data.

Principal Author: Kevin Dsouza
Collaborators:

## Abstract

This page introduces the salient characteristics of Recurrent neural networks (RNNs) and what differentiates them from regular feedforward neural networks. The architecture and the mathematical formulations of RNNs are discussed in detail and examples of their working are shown. Finally, their limitations are analyzed and further improvements in architecture are discussed.

### Builds on

Recurrent neural networks are special types of Artificial Neural Networks that can detect correlations in temporal data.

### Related Pages

Recurrent networks depart from the regular function approximators like feedforward neural networks and networks that analyze multidimensional tensors like convolutional neural networks.

## Content

### Introduction

Our world is made of data streams that are highly correlated in time, for example, musical notes or words in a text. In a song, a particular note is not entirely independent of the note that follows it and such dependencies regularly present themselves throughout the song. In a text, a certain type phrase is more likely to be followed by another specific type of phrase exposing the underlying structure of language. If we want to design a system that is capable of understanding music or text then we need to design a network that can model these temporal correlations as well.

Regular feedforward networks are networks which take in input data and compute a mathematical transformation on these to produce the outputs. The outputs are labels of categories in case of a classification task or a continuous-valued variable in case of a regression task. It internalizes the information present in the data by assigning values to the weights in its network. This is carried out by minimizing a loss function and optimizing the weights by a technique like gradient descent. Let's say for example we are supposed to take raw patient records and predict the type of tumor present. For each sample of patient data, the output should be either "malignant" or "benign". In this case, one sample of data has no effect on the prediction of the succeeding sample because the samples are independent of each other. This makes feedforward networks forgetful in nature. They can not remember anything about their recent past but can only internalize the information in the overall data space.

### Recurrent networks

Recurrent networks were first published by Jeff Elman in 1990 [1]. This was a revelation in cognitive science and language as it departed from the original units of phenomes and words but rather proposed that these units are the emerging consequence of the structure in the data stream itself. RNNs take two sources of input, both their current example and what they have perceived previously in time. Therefore, a decision taken by an RNN at time ${\displaystyle t-1}$ affects the decision its going to take at time ${\displaystyle t}$ as well. So these two sources of input, the present input and the experience from the recent past combine to respond to new data samples that appear.

Recurrent networks inject their past outputs back in as inputs forming a feedback loop, therefore, they can be thought of as having a memory of their past states and this helps them to perform tasks that feedforward networks can't on some data sources that have information embedded in the sequence itself. This information of the past state is stored in what is called the hidden state of the network. This hidden state preserves the information that might affect samples arriving later in time thus capturing "long-term dependencies". RNNs can be thought of as networks that share weights over time. The process of maintaining this memory over time can be mathematically represented as

${\displaystyle h_{t}=f(Wx_{t}+Uh_{t-1})}$

The hidden state at time ${\displaystyle t}$ is denoted by ${\displaystyle h_{t}}$ which is a function of the present input ${\displaystyle x_{t}}$ weighted by a weight matrix ${\displaystyle W}$ and the hidden state at the previous time instant ${\displaystyle h_{t-1}}$ weighted by its own hidden-state to hidden-state matrix which is also called as a transition matrix similar to the one as in Markov chains. These weight matrices assign relative importance to the present input and the previous state of the network.

Figure 1: Working of a recurrent neural network (source: [2])

The sum of these two weighted factors is passed through a function ${\displaystyle f}$, which can squash very large and very small values into the logistic space. This function can either be a sigmoid or a tanh. As long as the memory can persist, the hidden state can maintain traces of not only the previous state but also of the states that preceded it. An animation of operation of the RNN over time is shown in Figure 1 [2]. The first line of nodes can be thought of as a regular feedforward network which unrolls as time progresses. In the diagram, ${\displaystyle x}$ are the sequential inputs, ${\displaystyle w}$ are the weights, ${\displaystyle f}$ is the activation function and ${\displaystyle b}$ is the output of the activation function after it has been transformed.

### Backpropagation for recurrent networks

#### 1.Backpropagation Through Time (BPTT)

Backpropagation in recurrent networks is similar to feedforward networks with the additional computation of the gradient being passed down across different time steps. Neural networks are essentially a composition of functions like ${\displaystyle f(g(x))}$ and adding an additional time component will only extend the series of functions for which the chain rule of derivatives needs to be applied.

Consider a length ${\displaystyle n}$ training sequence. After defining the loss function at the output, the total loss would be the average of the losses at all the time steps. This loss is backpropagated through the network across each time step, starting from the last time step and the gradients are accumulated across all these time steps. As the weights are shared by all the time steps, after propagating through the unfolded network, the accumulated gradients are used to update the shared weights.

#### 2.Truncated BPTT

An approximation of the full BPTT called truncated BPTT is preferred with long sequences. The cost of propagating the gradients through the entire length of the sequence is not desirable with long sequences hence they are propagated only a specified ${\displaystyle k}$ time steps with ${\displaystyle k. The downside of this is that the network will not be able to learn dependencies which go as far back as the length of the sequence itself.

Pseudocode for the truncated BPTT is given below [3], where the training sequence is of length ${\displaystyle n}$ but the network is unfolded for ${\displaystyle k}$ time steps:

Back_Propagation_Through_Time(x, b)   // x[t] is the input at time t. b[t] is the output
Unfold the network to contain k instances of f
do until stopping criteria is met:
h = the zero-magnitude vector;// h is the current context
for t from 0 to n - k         // t is time. n is the length of the training sequence
Set the network inputs to h, x[t], x[t+1], ..., x[t+k-1]
p = forward-propagate the inputs over the whole unfolded network
e = b[t+k] - p;           // error = target - prediction
Back-propagate the error, e, back across the whole unfolded network
Sum the weight changes in the k instances of f together.
Update all the weights in f.
h = f(h, x[t]);           // compute the context for the next time-step


### Applications of RNNs

Recurrent networks can do very exciting things with sequences of data and are more appealing if we want to build intelligent systems that need to have the memory of the temporal sequence of events.

#### 1. Character-Level Language Models

RNN's can be used to train character level language models. In a huge piece of text, RNN will able to model the probability distribution of the next character given the sequence of characters before it. This is very useful in creating generative models.

Figure 2: RNN with confidences of the next character as ouput (source: [4])

As a working example [4], lets us assume we only had a vocabulary of four letters “helo”, and wanted to train an RNN on the training sequence “hello”. This training sequence can be broken down to 4 separate training examples:

1. The probability of “e” should be likely given the context of “h”

2. “l” should be likely in the context of “he”

3. “l” should also be likely given the context of “hel”

4. “o” should be likely given the context of “hell”.

Therefore, we encode each character into a vector using one-hot encoding (i.e. all zeros except for a one at the index of the character in the vocabulary) and feed them into the RNN one at a time. We then obtain a sequence of 4-dimensional output vectors (vocabulary size), which is the confidence the RNN assigns to each character coming next in the sequence. This is illustrated in Figure 2 [4]. This output can then be passed to a softmax function which would give us a probability distribution over the output from which we can do a maximum a posteriori estimate or a sampling-based estimate to get the next character.

Figure 3: Encoding of a sentence by RNN (source: [5])

For example, in the first time step when the RNN saw the character “h” it assigned confidence of 1.0 to the next letter being “h”, 2.2 to letter “e”, -3.0 to “l”, and 4.1 to “o”. Because in our training string “hello” the next correct character is “e”, we would like to increase its confidence (green) and decrease the confidence of all other letters (red). Similarly, we have a target character at every one of the 4 time-steps that we’d like the model to output. Since the RNN operations are differentiable, the backpropagation algorithm can be used to figure out in what direction the weights need to be adjusted to increase the scores of the desired targets (green bold numbers). A parameter update is performed, which pushes every weight in this gradient direction. If we were to feed the same inputs to the RNN after the parameter update we would find that the scores of the desired targets would be higher, and the scores of other characters would be lower. This process is then repeated until convergence; until the correct characters are always predicted next.

#### 2. Word translation

RNNs can be used to encode sentences and later translate these encodings to words in other languages. As the machine cannot understand each sentence by itself, we convert the sentence into a fixed length vector encoding. To generate this encoding, we’ll feed the sentence into the RNN, one word at a time. The final result after the last word is processed will be the values that represent the entire sentence. This is illustrated by the animation in Figure 3.

We now have a way to represent an entire sentence as a unique vector! We don’t know what each number in the encoding means, but it doesn’t really matter. As long as each sentence is uniquely identified by its own set of numbers, we don’t need to know exactly how those numbers were generated. We can take two RNNs and hook them up end-to-end. The first RNN would generate the encoding that represents a sentence and the second RNN would take that encoding and translate the sentence into Spanish (or any other language) as shown Figure 4.

Figure 4:Translate English to Spanish (source: [5])

### Problems in recurrent networks

#### 1. Vanishing and exploding gradients

Figure 5: Derivatives of activation functions (source: [6])

The gradients calculate the change in the loss relative to the change in the weights of the network. In order to calculate the effect of events that took place many time steps before the output, the gradients have to be passed down a long way and get diminished or exploded in the process. This is primarily because of the multiplying nature of the recurrent neural network. As we unfold the network in time, the gradient values (activation outputs) get multiplied and depending on whether they are greater or less than one, they either explode or vanish. One can think of this as similar to compound interest, where a quantity continuously multiplied by a number greater than one, soon becomes very large and similarly if multiplied by a number less than one, becomes very small.

Figure 6: Continuous multiplication of sigmoids (source: [2])

The main reason behind these are the activation functions which keep multiplying across time. The sigmoid given by

${\displaystyle S(x)={\frac {1}{1+e^{-x}}}}$

has derivative outputs only in the range of ${\displaystyle (0,0.25)}$. This is problematic as continuous multiplication of sigmoids would make the gradients vanish. This is why the tanh and the ReLU activation functions are sometimes preferred because of their relatively higher derivative values. Figure 5 compares the derivatives of the three activation functions. It can be seen that the ReLU is a much better choice in this case as it's derivative is always one for input values greater than zero. Figure 6 shows how the continuous multiplication of sigmoids can squash the curve and flatten it out until there is no detectable slope which is analogous to the vanishing of gradient values over time.

Also, the exploding of gradients is also an issue if the weights aren't initialized the right way. Generally, the weights are taken from a Gaussian distribution with mean zero and standard deviation one which will not result in exploding gradients but the wrong initialization of weights could make the weights reach large values and make the gradient values explode. Both the scenarios of vanishing and exploding gradients are highly undesirable during training time.

#### 2. Problems with BPTT

Like backpropagation in feedforward networks, BPTT also has the tendency of getting stuck in local minima and it's much worse than the regular feedforward case [7]. The recurrent feedback loops in these networks tend to create chaotic responses in the error surface which cause local minima to occur much more frequently and in bad locations on the error surface.

### Solutions

#### 1. Choose better activation functions

In order to mitigate the vanishing gradient problem to some extent, we can choose better activations functions that don't produce very low activations. Some of the activations we can choose are tanh and ReLU.

A simple solution for exploding gradients is to scale them down whenever they go beyond a certain threshold [8]. This is an acceptable thing to do during stochastic gradient descent and gives satisfactory results.

#### 3. Long Short Term Memory (LSTM)

The vanishing gradient problem is a slightly more difficult problem to solve and is also an important one because it doesn't allow us to model long-term dependencies in our data. This is problematic for long sequences like time series data and long sentences. In order to solve this problem, certain architectural changes are made and a new unit called LSTM is introduced instead of the Simple Recurrent Unit (SRN). LSTM's are really powerful at handling sequential data and have given state-of-the-art results on language and speech related tasks. These will be discussed in detail along with implementation using TensorFlow in the coming chapters.

## Annotated Bibliography

1. Elman, Jeffrey L. , "Finding structure in time", Cognitive science 14.2 (1990): 179-211.
2. M.P. Cuéllar and M. Delgado and M.C. Pegalajar (2006), "An Application of Non-linear Programming to Train Recurrent Neural Networks in Time Series Prediction Problems", Enterprise Information Systems VII. Springer Netherlands: 95–102.
3. Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio "On the difficulty of training recurrent neural networks.", International Conference on Machine Learning. 2013.