Course:CPSC522/Regularization for Neural Networks

Hypothesis

Does Dropout^[1] or Batch Normalization^[2] actually perform better than L2 regularization?

Principal Author: Tian Qi (Ricky) Chen

Abstract

Due to the complexity of neural networks, overfitting is extremely common. With the increasingly large amounts of data available, noisy data is extremely prevalent and feature engineering is more applicable than ever. Deep neural networks are hailed as the holy grail of function approximations capable of learning complex functions with an inherent ability to engineer useful feature. However, a deep network's hidden layers are computed so as to minimize the training loss, and excessively complex networks overfit very easily and lose the ability to generalize. Thus regularization methods are necessary to prevent overfitting. While Dropout^[1] has recently become popular as a regularization method, it's not immediately clear why it work or how it achieves different results from other methods such as L2 regularization. Meanwhile, Batch Normalization^[2] aims to increase robustness of learning algorithms with respect to hyperparameters such as the learning rate and accelerating learning, while also making bold claims that it may "in some cases eliminate the need for Dropout". We experiment for both Dropout and Batch Normalization on toy regression problems, and analyze their properties with respect to each other and to the more traditional L2 regularization. We show that while all three methods claim to act as regularization, they do not act in the same way and it may be beneficial to use all three when the data is sufficiently noisy.

We are not tackling the model selection problem, but simply hypothesizing how Dropout and Batch Normalization affect the learning process of neural networks.

Background

artificial neural networks, regression, probability

Method Descriptions

This section introduces the three regularization methods we will be discussing. The traditional L2 regularization is a simple modification to the loss function, which consequently changes the gradients of the network. While dropout and batch normalization are changes to the network architecture itself.

Dropout and Batch Normalization can both be implemented as transfer or activation functions. That is, they are element-wise operations on the network units (both hidden and visible).

Dropout

For a unit $x$ , the Dropout operator is

D (x) : = {\begin{cases} x & with probability p \\ 0 & with probability 1 - p \end{cases}

That is, each hidden unit is "dropped" (set to zero) with probability $1 - p$ , where $p$ is a tune-able hyperparameter and is the percentage of units that are retained. The output of the dropout operator is random and changes for each mini-batch during the course of training. Since only retained units are passed in the forward-propagation, during backpropagation only gradients of the retained units are passed. (Other gradients are just zero.)

Applying dropout to a neural network once amounts to sampling a "thinned" version of the network. As such, training a network with dropout can be interpreted as training many thinned networks simultaneously. Since every unit is retained with probability $p$ during training, the outgoing weights of that unit are multiplied by $p$ at test time.

Dropout is loosely motivated by the choice of genes in sexual reproduction. The argument is as follows: "the criterion for natural selection may not be individual fitness but rather mix-ability of genes. The ability of a set of genes to be able to work well with another random set of genes makes them more robust. Since a gene cannot rely on a large set of partners to be present at all times, it must learn to do something useful on its own." ^[1]

Dropout can be seen as process of reducing co-adaptation of the network units. It aims to make each hidden unit more robust in the sense that it can create useful features on its own rather than adapting itself to other hidden units.

Batch Normalization

For a unit $u$ , the BN operator is

B N_{γ, β} (u) : = γ (\frac{u - E [u]}{\sqrt{V a r (u)}}) + β

Where $γ, β$ are parameters to be learned. That is, the hidden unit is standardized with respect to its own distribution, then multiplied by $γ$ and displaced by $β$ . A unit takes a different value for each training example; the mean and variance of these values are denoted $E [u]$ and $V a r (u)$ respectively. Batch normalization aims at normalizing a unit's distribution, and claims that this can both normalize the gradients and speed up training. After normalization, $γ$ and $β$ act as fail-safe parameters whereby setting $γ = \sqrt{V a r (u)}$ and $β = E [u]$ reduces the BN operator to an identity. Thus, if it is optimal to not normalize the unit, the network can learn the corresponding values for $γ, β$ to essentially remove the BN operator's influence.

Batch normalization is invented with a different purpose than dropout. While dropout simply claims to combat overfitting, batch normalization aims at reducing the effect of a hidden unit's distribution at training time. The authors of batch normalization claim that change in the distribution of the hidden units presents a problem because the layers need to continuously adapt to the new distribution and can lead to slower training. Batch normalization effectively normalizes the distribution, so it does not change as much during training.

BN - Approximation

Neural networks are often trained using mini-batch stochastic gradient descent because using every training sample is infeasible (time consuming). Thus the values for $E [u]$ and $V a r (u)$ are not known during each mini-batch. To combat this problem, the mean and variance are instead replaced with the mini-batch sample mean and sample variance. This adds extra randomness to the BN operator, since $E [u]$ and $V a r (u)$ may change between each mini-batch. In fact, the value of $B N (u)$ for a single training sample now also depends on the other examples in the mini-batch, so the output can change drastically due to the sample appearing in different mini-batches.

L2 Regularization

The more traditional L2 regularization does not change the neural network architecture explicitly and is a more general form of regularization that acts on the loss function. Specifically, for any function to be learned $\hat{f} (x; θ)$ that takes input $x$ and has learn-able parameters $θ$ , an extra additive term is included in the loss function.

\min_{θ} L (\hat{f} (x; θ), y) + \frac{λ}{2} | | θ | |_{2}^{2}

Where $L (\tilde{y}, y)$ is some loss function that measures how far apart two predictive values are. The term "L2" is due to the extra term being a 2-norm of the parameters $θ$ .

L2 regularization aims at reducing the magnitude of the parameters, and does not explicitly operate on the network units.

Experiments

Visualization of the dataset used for a toy regression problem.

Results for these regularization methods on various datasets such as MNIST, cifar-10, etc. are widely known ^[3]. However, rather than just looking at the best result of each method, we also wish to study the behavior of each method using other criterions: ease of use, effect on learning, effect on function approximation, robustness against randomization. To study these, we begin with a toy regression problem, where the pros and cons of each methods can be easily analyzed. Then we experiment with an autoencoder and visually analyze the difference in the features learned using dropout and batch normalization. The experiments are done using Torch7 ^[4].

(Note that the Dropout hyperparameter p refers to the drop rate instead of the survival rate in these experiments.)

Experiment 1: Toy Regression

We begin with a simple 1D regression problem

y = x^{3} + 𝒩 (0, 16)

The dataset consists of 200 samples, with 100 samples in the training set and 100 samples in the test set.

For each sample, a polynomial basis of degree 4 is given as the input to the neural network. That is, the input for each sample is the vector $[x x^{2} x^{3} x^{4}]$ .

The network architecture being used is a deep multilayer perceptron with 5 hidden layers, 16 hidden units per layer, and ReLU activation functions. We use a complex network for a simple regression in order to force overfitting.

The Dropout and Batch Normalization operators are placed as transfer/activation layers after each linear transformation but before the ReLU activation.

The network is trained using stochastic gradient descent with momentum.^[5]

Below for each method, we present two graphs:

The first graph plots the resulting function approximation learned by the network. This visualizes how the network is adapting due to the regularization.
The second graph plots the training and test errors as the network is being trained. Each network is trained for 1000 epoches. At certain epoches, the learning rate is reduced by 4. This effect can be seen when the perturbations change in magnitude.

Baseline Network:

The network slightly overfits on the training dataset, as is evident by the zigs and zags of the function approximation.
The network shows signs of overfitting in the second graph, as the training error is decreasing steadily but the test error is increasing at the same time.

Dropout:

Dropout results in a function approximation that is simple, but does not rely on the input $x^{3}$ .
Dropout seems to be taking advantage of the ReLU activations to form a step-like function approximation. This is likely due to the network realizing it cannot simply rely on the input $x^{3}$ due to Dropout operators in the network.

Batch Normalization:

Batch normalization very clearly reduces the magnitude of the perturbations of the error compared to other methods.
Batch normalization achieves the best result.

L2 Regularization:

L2 regularization in this case is greatly motivated by the Gaussian noise in the outputs.
L2 regularization performs very well, due to the underlying function being a Gaussian process, where the mean can be computed exactly from the input $x^{3}$ and the distribution of $y$ is Gaussian.
L2 regularization appears to rely on the input $x^{3}$ the most compared to other methods. (The function approximation has the characteristic of $x^{3}$ .)

Variant - Toy Regression with Noisy Features

In this variant, the input is augmented with an additional noise feature that is sampled from a Normal(0,1) distribution. This feature has absolutely no correlation with the label. However, due to the excessive complexity of the network, the network is able to essentially use the extra parameter to create features that fit the training data extremely well. The purpose of this variant is to enforce overfitting and to observe how each of the regularization methods adapt to noisy features.

The input to the network is now $[x x^{2} x^{3} x^{4} ϵ]$ where $ϵ \sim N (0, 1)$ . This effectively increases the input layer by 1 unit and adds 16 extra parameters (from this extra unit to each of the hidden units on the first layer).

Baseline Network:

The network quickly overfits. The test error shoots up very high very fast.
The function approximation has extreme fluctuations. The outliers at each end have high influence on the network.

Dropout:

Dropout does still overfit, as the test error is increasing slightly as training progresses.
Dropout reduces the function approximation's fluctuations to within the noise, and the outliers no longer have much impact.

Batch Normalization:

Batch normalization performs the best.
Batch normalization is able to find the local optimum very quickly.

L2 Regularization:

L2 regularization performs similarly to Dropout.

Experiment 2: Autoencoder

The second experiment is to construct an autoencoder for the MNIST^[6] dataset.

The MNIST dataset contains 28x28 greyscale images of handwritten digits.

An autoencoder has two components: an encoder and a decoder. The encoder takes an image and extracts an embedding of smaller size. The decoder then takes this embedding and attempts to reconstruct the original image. The autoencoder is trained to minimize the reconstruction error.

Our experiment is to train a simple autoencoder where the encoder is a function

z = σ (W_{e} x)

And the decoder is

\tilde{x} = W_{d} z

Where $z$ is the embedding generated from an image $x$ , and $\tilde{x}$ is the reconstructed image. And $σ$ is some nonlinear activation function.

We implement this as a neural network with one hidden layer for $z$ with 256 hidden units and ReLU activation functions.

Typically, L1 and L2 regularizations are not added to the loss because we don't want to smooth out the reconstruction. Instead, we experiment with dropout. Batch normalization is not included as it did not yield anything interesting. The reader is encouraged to run the code themselves to experiment.

The network is trained using stochastic gradient descent with momentum. ^[5]

For each experiment below, we present two images:

Samples taken from the test set and their reconstructed versions.
Features of the encoder ( $W_{e}$ ) that shows what each unit of the embedding is generated from.

Baseline Network:

The baseline methods without any regularization is the best at minimizing the reconstruction error.
However, we see that the features of the encoder are not recognizable by humans and simply appear as random noise.

Dropout (p=0.5) on the input:

Randomly dropping out the input units ( $x$ ) is equivalent to what is known as a denoising autoencoder. A denoising autoencoder takes a perturbed (noisy) image, and tries to reconstruct the original version.

Having only access to 50% of the original image on average, this denoising autoencoder is still able to reconstruct the image at an amazing level.
Some parts of the recontructions appear to be more blurry than that of the baseline reconstructions.
The features are more recognizable as different strokes, or even most of a number.
Many of the features still appear as random noise, or simply not being used (all black).

Dropout (p=0.5) on the embedding:

The reconstructions are even more blurred. This is expected since the decoder only has access to 50% of the embeddings on average.
The features are clearly varying types of strokes.
Almost all of the features are now used.
What appeared as random noise in other models have been reduced by a lot. This is possibly due to the random noise co-adapting to each other. Dropout on the hidden layer forces them to not co-adapt.

Conclusion

We've constructed experiments to show how Dropout and Batch Normalization can affect the learning of neural networks. We may not be able to conclude its efficiency as pure regularization methods that combat overfitting, but we can conclude some neat side effects of applying these techniques:

Dropout has the property that is increases the fluctuation magnitude of the error, likely due to it only using a portion of the network at a time.
Dropout performs similarly to L2 regularization in the toy regression problem. However, it is easier to tune than L2 regularization due to its hyperparameter spanning only [0,1] while L2 regularization's hyperparameter spans $ℝ$ .
Dropout has the property of making the features very visually appealing and easy to understand.

Batch Normalization performs spectacularly. The inclusion of batch normalization makes the training much more efficient, and gradient descent is able to find a good solution much faster than other methods.
Batch Normalization has the wonderful property of fixing the learning rate, essentially removing a network's gradient descent sensitivity on the learning rate. We see that without BN, the error fluctuates varied depending on the learning rate, and with BN, the fluctuations stayed the same even when the learning rate was decreases by a factor of 16.

BN - Caution

There is, however, a slight caveat to Batch Normalization. The BN operator has a division by $V a r (u)$ , and if $V a r (u)$ is too small, the BN operator may increase round-off error due to this division by $V a r (u)$ and subsequent multiplication by $γ$ . Depending on the floating point arithmetic implementation, different machines may get vastly different results from using BN.

Below are results running the exact same code on the toy regression with noisy features and batch normalization, on two different machines.

Intel(R) Core(TM) i7-5930K CPU:

Intel(R) Core(TM) i5-3317U CPU:

The software packages used are the exact same version and the random numbers generated are also verified to be the same. Even with different seeds, the two machines achieve similar results (good results on the i7 desktop; bad results on the i5 laptop), indicating that randomization (in both weight initialization and mini-batching) is not the culprit.

The difference seems to arise due to floating point arithmetic, though this is only an educated guess.

Future Work

DropConnect^[7] drops the weights instead of units. It may be helpful to learn how this affects learning for neural networks.

References

↑ ^1.0 ^1.1 ^1.2 Srivastava, Nitish, et al. "Dropout: A simple way to prevent neural networks from overfitting." The Journal of Machine Learning Research 15.1 (2014): 1929-1958.
↑ ^2.0 ^2.1 Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).
↑ Rodrigo Benenson. State of the art results on popular datasets.
↑ https://github.com/tqichen/nn-regularization-experiments
↑ ^5.0 ^5.1 Bottou, Léon. "Stochastic learning." Advanced lectures on machine learning. Springer Berlin Heidelberg, 2004. 146-168.
↑ LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324.
↑ Wan, Li, et al. "Regularization of neural networks using dropconnect." Proceedings of the 30th International Conference on Machine Learning (ICML-13). 2013.

[dropout-1] 1.0 ^1.1 ^1.2 Srivastava, Nitish, et al. "Dropout: A simple way to prevent neural networks from overfitting." The Journal of Machine Learning Research 15.1 (2014): 1929-1958.

[batchnormalization-2] 2.0 ^2.1 Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).

[stateoftheart-3] Rodrigo Benenson. State of the art results on popular datasets.

[sourcecode-4] ttps://github.com/tqichen/nn-regularization-experiments

[sgd-5] 5.0 ^5.1 Bottou, Léon. "Stochastic learning." Advanced lectures on machine learning. Springer Berlin Heidelberg, 2004. 146-168.

[mnist-6] LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324.

[dropconnect-7] Wan, Li, et al. "Regularization of neural networks using dropconnect." Proceedings of the 30th International Conference on Machine Learning (ICML-13). 2013.

[1]

[2]

[3]

[4]

[5]

[6]

[7]