# Course:CPSC522/Variational Auto-Encoders

## Variational Auto-Encoders

Paper 1: Kingma, Diederik & Welling, Max. (2014). Auto-Encoding Variational Bayes.

Paper 2: Narayanaswamy, Paige, Meent, Desmaison, Wood, Goodman, Kohli, & Torr, Philip. (2017). Learning Disentangled Representations with Semi-Supervised Deep Generative Models.

Principal Author: Peyman Bateni

## Abstract

Variational auto-encoders (often abbreviated as VAEs) define a relatively large family of graphical models used for density estimation, usually used in unsupervised and semi-supervised settings. That is, more specifically, given a set of observations, we would like to learn a latent space where observations have corresponding latent representations that best describe and distinguish them. Additionally, we would like to in parallel learn an encoder (i.e. an inference function that maps observations onto the latent space) and a decoder which would act a generative model that produces the corresponding observation for a given latent representation. While elegant and powerful, many problems arise in learning of VAEs concerning intractable posteriors, complex models using neural networks, and learning from big data with comparatively limited computational resources. In this page, we will discuss two papers concerning VAEs, the first proposing the earlier unsupervised VAE and the second, extending such work to include semi-supervised discrete latent labels. It must be emphasized that in an attempt to make this page concise and informative, various details of both papers and have been excluded. Therefore, interested readers should strongly consider reading the papers themselves.

### Builds on

This page builds on Variational Inference, Graphical Models, Markov Networks, Variable Elimination, Markov Chain Monte Carlo, Particle Filtering, and Deterministic Auto-Encoder. It must be emphasized that while many readers may be able to navigate this page having the relevant background knowledge, it's highly recommend to still review Variational Inference for greater familiarity with the problem setup and the notation.

### Related Pages

Variational auto-encoders are closely related to other generative models such as Generative Adversarial Networks.

## Content

### Background: Auto-Encoders

Auto-encoders describe encoder-decoder models where the input to the encoder and the output of the decoder are of the same type. An example could be image to image auto-encoding or a similar thing for text [2]. The key is that the encoder maps the input to a latent space and the decoder re-constructs that input from the latent embedding. Variational auto-encoders, as explained in details within the next section, extend this idea by representing latent embedding in form of variational distributions that define the latent embedding probabilistically.

Figure 1: Overview of a variational auto-encoder operating on image data Note that here ${\displaystyle g_{\phi }}$and ${\displaystyle f_{\theta }}$correspond to ${\displaystyle q_{\phi }({\textbf {z}}|{\textbf {x}})}$and ${\displaystyle p_{\theta }({\textbf {x}}|{\textbf {z}})}$respectively. Figure is from https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html.

### Paper 1: Auto-Encoding Variational Bayes

Consider the following problem set-up borrowed from Variational Inference "where a set of ${\displaystyle N}$ observed random variables (denoted as ${\displaystyle {\bf {x}}}$) and a corresponding set of ${\displaystyle M}$ latent variables (denoted by ${\displaystyle {\bf {z}}}$) are given where the joint probability density is defined by ${\displaystyle p_{\theta }({\textbf {x}},{\textbf {z}})}$ parameterized by ${\displaystyle \theta }$." In this setup, a VAE would be defined as the following: a learned latent space ${\displaystyle {\bf {z}}}$, where latent representations describe and distinguish observations and the corresponding generative and inference functions (${\displaystyle f_{\theta }}$ and ${\displaystyle g_{\phi }}$), where the former generates observations given latent representations and the latter maps such observation onto the corresponding latent space.

For instance, consider Figure 1 as an example of a VAE operation on image data. Here, the variational auto-encoder shown learns two functions that together define the latent space ${\displaystyle {\bf {z}}}$. This latent space can be defined by some appropriate pre-defined distribution such a Gaussian in the case of image representations. The inference/generative functions consist of neural networks, forming none-linear function mappings that for ${\displaystyle q_{\phi }({\textbf {z}}|{\textbf {x}})}$ (denoted as ${\displaystyle g_{\phi }}$ in the figure) map from the image to the parameters of the Gaussian defining the latent representation, and for the generative ${\displaystyle p_{\theta }({\textbf {x}}|{\textbf {z}})}$, map from such latent representation to the pixel values of the image.

In many settings, the posterior ${\displaystyle p_{\theta }({\textbf {z}}|{\textbf {x}})}$ can be intractable to estimate both using the closed form Bayesian definition and through sampling methods such as MCMC [4], Gibbs sampling [5], Metropolis-Hastings sampling [6] which can become exponential in computation to convergence. Variational Inference provides a nice objective, in form of ELBO, for producing an approximation ${\displaystyle q_{\phi }({\textbf {z}}|{\textbf {x}})}$ to the posterior by learning the parameters of a well-known appropriate distributional family (such as Gaussian for continuous densities). This works well when we parameterize the posterior approximation ${\displaystyle q_{\phi }({\textbf {z}}|{\textbf {x}})}$ and the corresponding generative model ${\displaystyle p_{\theta }({\textbf {x}}|{\textbf {z}})}$ using simple functions that operate on limited data. However, with the abundance of data in many problem settings today, neural networks can enhance performance as powerful none-linear function approximators.

However, in order to take advantage of this power, our VAE must be end-to-end differentiable to learn the inference and the generative network both together. Given images, and image reconstructions, differentiability would be key in being able to update parameters across the network. To resolve this problem, Kingma et al. use a simple yet powerful trick to form the Stochastic Gradient Variational Bayes (SGVB) estimator.

### Stochastic Gradient Variational Bayes (SGVB) estimator

Let's revisit the Evidence Lower Bound Optimization (ELBO) objective. As noted in Variational Inference, we can drive ELBO from the KL-divergence between the true posterior and the approximation, leading to the lower bound:

${\displaystyle ELBO(q_{\phi }({\textbf {z}}|{\textbf {x}}))=\mathbb {E} [{\text{log}}(p_{\theta }({\textbf {z}},{\textbf {x}}))]-\mathbb {E} [{\text{log}}(q_{\phi }({\textbf {z}}|{\textbf {x}})]}$
Through simple manipulation of the joint probability ${\displaystyle p_{\theta }({\textbf {x}},{\textbf {z}})}$we can arrive at the following definition of ELBO [1]:
${\displaystyle ELBO(q_{\phi }({\textbf {z}}|{\textbf {x}}))=\mathbb {E} [{\text{log}}(p_{\theta }({\textbf {x}}|{\textbf {z}}))+\mathbb {E} [{\text{log}}(p_{\theta }({\textbf {z}}))]-\mathbb {E} [{\text{log}}(q_{\phi }({\textbf {z}}|{\textbf {x}})]}$
${\displaystyle ELBO(q_{\phi }({\textbf {z}}|{\textbf {x}}))=\mathbb {E} [{\text{log}}(p_{\theta }({\textbf {x}}|{\textbf {z}}))]-KL(q_{\phi }({\textbf {z}}|{\textbf {x}})||p_{\theta }({\textbf {z}}))}$
With the last line, we can now see a direct connection to training the VAE with the ELBO objective where the first part of the objective, namely ${\displaystyle \mathbb {E} [{\text{log}}(p_{\theta }({\textbf {x}}|{\textbf {z}}))]}$ can be seen as a reconstruction loss while the second element ${\displaystyle KL(q_{\phi }({\textbf {z}}|{\textbf {x}})||p_{\theta }({\textbf {z}}))}$ produces a regularizer based on the prior ${\displaystyle p_{\theta }({\textbf {z}})}$. This prior can be thought of placing a uni variate-normal Guassian prior on the latent space, encouraging simpler parameterization while providing a prior to be sampled from during the early stages of training. The problem arises here, where the first element in the ELBO objective would be obtained as follows: parameters for the latent distribution ${\displaystyle {\bf {z}}}$ would be produced by based on the input image ${\displaystyle g_{\phi }}$, then a subsequent latent representation is sampled from the space, and then conditioned on this latent representation, a reconstruction is produced. This would make the objective where parameters of the posterior distribution ${\displaystyle q_{\phi }({\textbf {z}}|{\textbf {x}})}$ is produced by the inference network ${\displaystyle g_{\phi }}$:

${\displaystyle {\text{Min}}_{\phi ,\theta }[\mathbb {E} [{\text{log}}(f_{\theta }(g_{\phi }(x)))]-KL(q_{\phi }({\textbf {z}}|{\textbf {x}})||p_{\theta }({\textbf {z}}))]}$

The sampling step as described here would be stochastic but not differentiable. However, we need it to still be stochastic but differentiable to be able to learn the VAE end-to-end. This is resolved by the SGVB estimator. Here, instead of sampling ${\displaystyle {\bf {z}}}$ from ${\displaystyle q_{\phi }({\textbf {z}}|{\textbf {x}})}$ directly, we consider a stochastic noise ${\displaystyle \epsilon \sim {\mathcal {N}}(0,I)}$ and a deterministic mapping function ${\displaystyle h}$ that produces a latent representation ${\displaystyle {\bf {z}}}$ from output of ${\displaystyle g_{\phi }}$and stochastic noise ${\displaystyle \epsilon }$. That is more specifically, the objective becomes ${\displaystyle {\text{Min}}_{\phi ,\theta }[\mathbb {E} [{\text{log}}(f_{\theta }(h(g_{\phi }(x),\epsilon )))]-KL(q_{\phi }({\textbf {z}}|{\textbf {x}})||p_{\theta }({\textbf {z}}))]}$ where ${\displaystyle h}$ produces the latent representation sample. Now, ${\displaystyle h}$ can differ depending on the choice of the distribution family to show the posterior distribution. For instance for a Gaussian, ${\displaystyle h(\{\mu ,\sigma \},\epsilon )=\mu +\epsilon \times \sigma }$ where ${\displaystyle \mu }$ and ${\displaystyle \sigma }$ are produced by ${\displaystyle g_{\phi }}$. This way we maintain both the stochasticity of sampling while making the network end-to-end differentiable. Now, learning the VAE is a question of batch learning of the updated objective using stochastic gradient descent.

### Paper 2: Learning Disentangled Representations with Semi-Supervised Deep Generative Models

The second paper discussed in this page extends the VAE framework for disentangled learning in a semi-supervised fashion. Specifically, it adds the following set of contributions:

• Disentangling the latent embedding ${\displaystyle {\bf {z}}}$ into a discrete embedding ${\displaystyle {\textbf {y}}}$ and a continuous embedding, which to be consistent with their notation, is referred to as ${\displaystyle {\bf {z}}}$.
• Producing a semi-supervised framework such that with few labels of ${\displaystyle {\textbf {y}}}$, the model is able to separate the label embedding into the discrete random variable while caputring continuous differences in ${\displaystyle {\bf {z}}}$. You can think of this as the case where ${\displaystyle {\textbf {y}}}$ captures digits and ${\displaystyle {\bf {z}}}$ captures the hand-writing.

Note that the latent space is learned in an unsupervised way; we never have a loss signal for ${\displaystyle {\bf {z}}}$ itself but rather the reconstruction loss that comes from ELBO. Although this latent representation is meaningful in statistical sense, it's difficult to directly associate a particular embedding with a semantic label. However, in many cases, such as hand written digits, while there is considerable variability in the style of hand writing that can be captured in the latent space, we can disentangle the actual digit the hand-written example belong to using a Categorical distribution.

This way, we are effectively augmenting our variational auto-encoder to consider the case where for ${\displaystyle N}$ observed random variables (denoted as ${\displaystyle {\bf {x}}}$) and a corresponding set of ${\displaystyle M}$ latent variables (denoted by ${\displaystyle {\bf {z}}}$), we also a corresponding discrete random variable ${\displaystyle {\textbf {y}}}$ that corresponds to the label for the observation. You can think of ${\displaystyle {\textbf {y}}}$ as one of 0 to 9 for digits, or medical labels in a medical imaging problem, etc. When pairs of observations and labels are given, we can easily extend the previous set up to a supervised discrete VAE by several simple modifications:

1. The generative model would additionally condition on the label, that is ${\displaystyle p_{\theta }({\textbf {x}}|{\textbf {z}})}$-> ${\displaystyle p_{\theta }({\textbf {x}}|{\textbf {z}},{\textbf {y}})}$
2. The objective ELBO becomes ${\displaystyle ELBO(q_{\phi }({\textbf {z}}|{\textbf {x}}))=\mathbb {E} [{\text{log}}(p_{\theta }({\textbf {x}}|{\textbf {z}},{\textbf {y}}))]-KL(q_{\phi }({\textbf {z}}|{\textbf {x}})||p_{\theta }({\textbf {z}}))}$ which we'll refer to as supervised ELBO
3. Learning as usual

Additionally, as the title of the paper suggest, the work extends to also include the case where only observation are available. This makes immense practical sense as for instance, image data is easy to obtain without labels, but labelling can cost a lot and therefore, a few image label pairs maybe available but we also, need to consider the unsupervised case without labels. This, as the paper notes, can be done effectively by introducing a second inference network designed to infer the label representation ${\displaystyle {\textbf {y}}}$. That is:

1. An addition inference network ${\displaystyle {\textbf {y}}}$ is introduced to learn parameters of the label posterior ${\displaystyle q_{\phi }({\textbf {y}}|{\textbf {x}})}$
2. The objective for the unsupervised ELBO becomes ${\displaystyle ELBO(q_{\phi }({\textbf {z}}|{\textbf {x}}))=\mathbb {E} [{\text{log}}(p_{\theta }({\textbf {x}}|{\textbf {z}},{\textbf {y}}))]-KL(q_{\phi }({\textbf {z}}|{\textbf {x}})||p_{\theta }({\textbf {z}}))-KL(q_{\phi }({\textbf {y}}|{\textbf {x}})||p_{\theta }({\textbf {y}}))}$ which we'll refer to as supervised ELBO
3. Learning as usual

Furthermore, given the few observation and label pairs that may be given in the supervised case, we can additionally introduce a classification loss that adds a supervised signal for predicted the correct ${\displaystyle {\textbf {y}}}$, a prime example of which is the cross entropy loss which consists of a negative log likelihood of a Softmax function on the output. This way, we now have three specifics objectives: 1. the supervised ELBO (denoted as ${\displaystyle {ELBO}_{s}}$), 2. the unsupervised ELBO (denoted as ${\displaystyle {ELBO}_{u}}$), and the classification label loss ${\displaystyle L_{y}}$. This leads to the semi-supervised disentangled objective for learning a semi-supervised disentangled VAE:

${\displaystyle {\text{Min}}_{\phi ,\theta }[ELBO_{s}+ELBO_{u}+L_{y}]}$

This objective can now be used for learning, for instance, a simple image recognition task where a few examples are labelled. Training would be done through stochastic variational gradient descent over the objective. Additionally, an extension to model is proposed where for the reconstruction loss, instead of sampled the label latent variable ${\displaystyle {\textbf {y}}}$, the objective enumerates through all possibilities. That is, ${\displaystyle \mathbb {E} [{\text{log}}(p_{\theta }({\textbf {x}}|{\textbf {z}},{\textbf {y}}))]}$ becomes ${\displaystyle \mathbb {E} [{\text{log}}(p_{\theta }({\textbf {x}}|{\textbf {z}}))]}$where ${\displaystyle p_{\theta }({\textbf {x}}|{\textbf {z}})=\sum _{y}p_{\theta }({\textbf {x}}|{\textbf {z}},{\textbf {y}})q_{\phi }(y|x)}$. This is empirically shown to produce better gradients, leading to a performance gain both of the tertiary classification task and on the reconstruction error.

Lastly, it must again be emphasized that both of these papers are extensive in scope with various novelties, such as the enumeration discussed just in the paragraph before, that come with complexities that deserve entire pages of their own. While they've been shortened for the purpose of being concise here, it's strongly encouraged that interested readers consult the papers themselves.

## Annotated Bibliography

[1] Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association 112.518 (2017): 859–877. Crossref. Web.

[2] Kingma, Diederik P., and Max Welling. “An Introduction to Variational Autoencoders.” Foundations and Trends® in Machine Learning 12.4 (2019): 307–392. Crossref. Web.

[3] Narayanaswamy, Paige, Meent, Desmaison, Wood, Goodman, Kohli, & Torr, Philip. (2017). Learning Disentangled Representations with Semi-Supervised Deep Generative Models.

[4] Robert, Christian, and George Casella. “A Short History of Markov Chain Monte Carlo: Subjective Recollections from Incomplete Data.” Statistical Science 26.1 (2011): 102–115. Crossref. Web.

[5] Walsh, B. (2004). Markov Chain Monte Carlo and Gibbs Sampling. Lecture Notes for EEB 581, version 26, April.

[6] Robert, Christian. (2015). The Metropolis—Hastings Algorithm. 10.1007/978-1-4757-4145-2_7.