# Course:CPSC522/Variational Inference

## Variational Inference

Variational inference is a probabilistic method for approximating the posterior of the latent variables in Bayesian models. It's especially effective when the posterior distribution is unknown and existing sampling methods are intractable (exponential in computational order). This is due to variational inference's ability produce a relatively accurate approximation by optimizing the parameters in a well-known predefined distribution families (e.g. Gaussian, Bernoulli, etc.) to best estimate the unknown posterior. The method is able to achieve reasonable accuracy at substantially better speed compared to its sampling-based counterparts.

Principal Author: Peyman Bateni

## Abstract

Accurate estimation of the posterior probability density is crucial in doing effective inference in probabilistic models. The posterior, which we'll define mathematically within the next section, consists of producing the conditional distribution of latent variables conditioned on the observations. Having this probability density, one can produce point and/or interval estimations of the latent variables, allowing for classification of new examples, with wide applications in various areas ranging from variational Encoder-Decoder models [2] to discriminative classifiers [3]. These operations using the posterior probability are referred to as inference, effectively inferring the latent variables from the observed values. While very powerful, the problem arises from the fact that the true posterior of a dataset is often unknown and without that, statisticians have to resort to approximate inference as a mean to achieve the same goals. To this end, variational inference is one of the two most prominent approaches for approximating the posterior, along side sampling. The method allows for tractable and reasonably accurate estimation of the posterior by parameter learning of chosen-in-advance well-defined distributions.

### Builds on

This page builds on Probability, Graphical Models, Markov Networks, and Weighted Model Counting.

### Related Pages

Variational inference is closely related to other pages on inference such as Variable Elimination, Markov Chain Monte Carlo, Particle Filtering, Treatment of Missing Data, and Bayesian Coresets.

## Content

### Challenges in Inference

Consider the case where a set of ${\displaystyle N}$ observed random variables (denoted as ${\displaystyle {\bf {x}}}$) and a corresponding set of ${\displaystyle M}$ latent variables (denoted by ${\displaystyle {\bf {z}}}$) are given where the joint probability density is defined by ${\displaystyle p_{\theta }({\textbf {x}},{\textbf {z}})}$ parameterized by ${\displaystyle \theta }$. Here, the generative model, otherwise known as the forward model, concerns the conditional generation of an observation subject to the latent value as defined by ${\displaystyle p_{\theta }({\textbf {x}}|{\textbf {z}})}$. The inference problem, which is the main focus of this page, instead focuses on the reverse, estimating the posterior ${\displaystyle p_{\theta }({\textbf {z}}|{\textbf {x}})}$ to produce a likelihood over the latent space ${\displaystyle {\textbf {z}}}$ conditioned on the observations ${\displaystyle {\textbf {x}}}$. When the probabilistic model ${\displaystyle p_{\theta }}$and the corresponding set of "truth" parameters ${\displaystyle \theta }$ are known, inference can be simply done using the Bayesian definition of the posterior. However, as is the case in almost all circumstances, the "true" distribution ${\displaystyle p_{\theta }}$ is not available, and this poses a challenge in estimating the posterior ${\displaystyle p_{\theta }({\textbf {z}}|{\textbf {x}})}$effectively.

Using the Bayesian definition, we can write the posterior as the following:

${\displaystyle p_{\theta }({\textbf {z}}|{\textbf {x}})={\frac {p_{\theta }({\textbf {x}},{\textbf {z}})}{p_{\theta }({\textbf {x}})}}}$

As shown, the join probability can be estimated from the tuples of observations and corresponding latent values. However, the denominator requires the marginal distribution of observations, which is also referred to as "evidence" [1]. To capture this evidence, we must integrate the latent variables ${\displaystyle {\textbf {z}}}$ out of the joint density ${\displaystyle p_{\theta }({\textbf {x}},{\textbf {z}})}$ which leads us to the following definition of the evidence:

${\displaystyle p_{\theta }({\textbf {x}})=\int p_{\theta }({\textbf {x}},{\textbf {z}}){\text{d}}\mathbf {z} }$

We are unable to estimate this evidence without the explicit knowledge of the family of distributions ${\displaystyle p_{\theta }}$is from. Therefore, this integral becomes impossible to compute. That said, while finding the true distributional function is potentially impossible, we can approximate the true distribution using well-known density functions such as a Gaussian Mixture Model (GMM). As such, we can then turn the parameters of the GMM into random variables themselves that can be optimized through Bayesian learning. This, however, lead us to the second major challenge in inference, dealing with density estimates that are possible to produce but exponential in computational cost (i.e., intractable. Consider the example provided by [1] where the latent space ${\displaystyle {\textbf {z}}}$ is instead approximated with a "Bayesian mixture of unit-variance uni-variate Gaussians". Here, ${\displaystyle K}$ Gaussian distributions with means ${\displaystyle \mu _{1},...,\mu _{K}}$are used to define the latent space. The means themselves are drawn from the same prior ${\displaystyle p(\mu _{k})}$, which is set to be a Gaussian model itself with mean of ${\displaystyle 0}$ and variance of ${\displaystyle \sigma ^{2}}$. The generative model, consists of first, sampling one of the ${\displaystyle K}$ categories, and then subsequently sampling from the corresponding Gaussian to form an observation. More specifically, the full hierarchy becomes:

${\displaystyle \mu _{k}\sim {\mathcal {N}}(0,\sigma ^{2}),k\in \{1,2,...,K\}}$
${\displaystyle c_{n}\sim {\text{Categorical}}(\{1/K\}_{k\in K}),n\in \{1,2,...,N\}}$
${\displaystyle x_{n}|c_{n},\mu \sim {\mathcal {N}}(c_{i}^{\top }\mu ,1),n\in \{1,2,...,N\}}$

According to this model, the latent space can be defined by the set ${\displaystyle \{\mu ,{\textbf {c}}\}}$ which gives rise to the joint probability density ${\displaystyle p({\textbf {x}},\mu ,{\textbf {c}})=p(\mu )\textstyle \prod _{i=1}^{N}\displaystyle p(c_{i})p(x_{i}|c_{i},\mu )}$. Using the definition of the evidence provided before, we can write the evidence for this model as the following:

${\displaystyle p({\textbf {x}})=\int p(\mu )\prod _{i=1}^{N}\textstyle \sum _{c_{j}=1}^{N}\displaystyle p(c_{j})p(x_{i}|c_{j},\mu ){\text{d}}\mu }$
While this integral can be evaluated, it's ${\displaystyle O(K^{N})}$ in computational time, scaling exponentially with more observed random variables and categories. We can potentially re-arrange the terms, exploiting the conjugacy between the Gaussian conditional and the Gaussian prior as suggested by [1]. However, even though it may allow for tractable integration on a single variable and category pair, there are still ${\displaystyle K^{N}}$ such combinations to evaluate, still requiring exponential time and hence becoming intractable overall.

### Evidence Lower Bound Optimization (ELBO)

The intractability of producing the posterior ${\displaystyle p_{\theta }({\textbf {z}}|{\textbf {x}})}$ motivates work on approximating the posterior probability density as oppose to directly calculating it. Methods for such approximation can be divided into two categories: 1. sampling methods such as MCMC [4], Gibbs sampling [5], Metropolis-Hastings sampling [6], etc., and 2. variational inference. While sampling methods attempt to estimate the posterior using samples that eventually converge towards the true posterior, variational inference methods attempt to learn a function approximator that optimizes the density function's parameters to best approximate the true distribution. The essential idea in variation inference can therefore be summarized into the following two steps:

1. Choose a well-known family of distributions ${\displaystyle \Gamma }$ that is appropriate to the task at hand (e.g. Gaussian for continuous data, categorical for discrete data, etc.) that is parameterized by ${\displaystyle \phi }$.
2. Find the set of the parameters ${\displaystyle \phi ^{*}}$ that minimizes the distance between the approximation of the posterior, namely ${\displaystyle q_{\Gamma ,\phi }({\textbf {z}}|{\textbf {x}})}$, and the true posterior ${\displaystyle p_{\theta }({\textbf {x}},{\textbf {z}})}$ as measured by Kullback–Leibler (KL) divergence.

This definition simplifies, certain aspects of variational inference. For instance, KL divergence is only one of the many possible choices for the distance metric. However, it exemplifies the simplicity of variational inference as a method for approximating unknown distributions, such as in this case and often in practice, the posterior. The second step, in this analogy, can be written as the following:

${\displaystyle q_{\Gamma ,\phi ^{*}}({\textbf {z}}|{\textbf {x}})={\text{argmin}}_{\phi }KL(q_{\Gamma ,\phi }({\textbf {z}}|{\textbf {x}})||p_{\theta }({\textbf {z}}|{\textbf {x}}))}$

Note that from here onward, we use ${\displaystyle q_{\phi }({\textbf {z}}|{\textbf {x}})}$ instead of ${\displaystyle q_{\Gamma ,\phi }({\textbf {z}}|{\textbf {x}})}$ for brevity of notation, as the distribution family ${\displaystyle \Gamma }$ is a hyper-parameter and not learned by the optimization process. Using the definition of KL divergence, we can expand the optimization problem of finding the optimal ${\displaystyle \phi }$, giving us the following:

${\displaystyle q_{\phi ^{*}}({\textbf {z}}|{\textbf {x}})={\text{argmin}}_{\phi }\mathbb {E} [{\text{log}}(q_{\phi }({\textbf {z}}|{\textbf {x}})]-\mathbb {E} [{\text{log}}(p_{\theta }({\textbf {z}}|{\textbf {x}})]}$
${\displaystyle q_{\phi ^{*}}({\textbf {z}}|{\textbf {x}})={\text{argmin}}_{\phi }\mathbb {E} [{\text{log}}(q_{\phi }({\textbf {z}}|{\textbf {x}})]-\mathbb {E} [{\text{log}}(p_{\theta }({\textbf {z}},{\textbf {x}}))]-{\text{log}}(p_{\theta }({\textbf {x}}))}$

We cannot minimize, or even compute, this objective due to the third term, since ${\displaystyle {\text{log}}(p_{\theta }({\textbf {x}}))}$depends on the density of the evidence (i.e., the marginal ${\displaystyle p_{\theta }({\textbf {x}})}$), and as we previously showed, calculating the evidence is intractable. However, instead of minimizing the KL divergence, we can alternatively maximize a lower bound to it that is only a constant value away. This lower bound is called the Evidence Lower Bound Objective (ELBO) and it's defined below:

${\displaystyle ELBO(q_{\phi }({\textbf {z}}|{\textbf {x}}))=-KL(q_{\Gamma ,\phi }({\textbf {z}}|{\textbf {x}})||p_{\theta }({\textbf {z}}|{\textbf {x}}))-{\text{log}}(p_{\theta }({\textbf {x}}))}$
${\displaystyle ELBO(q_{\phi }({\textbf {z}}|{\textbf {x}}))=-[\mathbb {E} [{\text{log}}(q_{\phi }({\textbf {z}}|{\textbf {x}})]-\mathbb {E} [{\text{log}}(p_{\theta }({\textbf {z}},{\textbf {x}}))]-{\text{log}}(p_{\theta }({\textbf {x}}))]-{\text{log}}(p_{\theta }({\textbf {x}}))}$
${\displaystyle ELBO(q_{\phi }({\textbf {z}}|{\textbf {x}}))=\mathbb {E} [{\text{log}}(p_{\theta }({\textbf {z}},{\textbf {x}}))]-\mathbb {E} [{\text{log}}(q_{\phi }({\textbf {z}}|{\textbf {x}})]}$
${\displaystyle q_{\phi ^{*}}({\textbf {z}}|{\textbf {x}})={\text{argmax}}_{\phi }ELBO(q_{\phi }({\textbf {z}}|{\textbf {x}}))}$

Note that this new objective, namely ELBO, is only ${\displaystyle {\text{log}}(p_{\theta }({\textbf {x}}))}$ away from the actual KL objective, and because the evidence, namely ${\displaystyle p_{\theta }({\textbf {x}})}$, is expected to be a constant, it removes the dependency on the evidence while assuring that maximizing the objective, would also maximize the negative KL divergence between the approximation ${\displaystyle q_{\phi }({\textbf {z}}|{\textbf {x}})}$ and the true posterior ${\displaystyle p_{\theta }({\textbf {z}}|{\textbf {x}})}$. This objective can now be optimized using an iterative descent algorithm such as coordinate ascent or stochastic variational ascent to converge to the optimal parameters ${\displaystyle \phi }$ [1]. The discussion of relevant ascent algorithms would be beyond the scope of this page, requiring a separate page of their own. That said, a pseudo-code overview [1] of the coordinate ascent optimization of ELBO has been provided for reference below. Note that in the figure, ${\displaystyle q_{\phi }({\textbf {z}}|{\textbf {x}})}$ and ${\displaystyle q_{\phi }({\textbf {z}})}$ have been used interchangeably.

Figure from [1]. An overview of coordinate gradient ascent optimization using the ELBO objective.

Once learned, the approximated posterior ${\displaystyle q_{\phi }({\textbf {z}}|{\textbf {x}})}$ can then be used in place of ${\displaystyle p_{\theta }({\textbf {z}}|{\textbf {x}})}$ to perform inference about the latent variables conditioned on the observation.

### Evidence Upper Bound Optimization (EUBO)

Alternatively, it is possible minimize an upper bound to the KL divergence between the approximation ${\displaystyle q_{\phi }({\textbf {z}}|{\textbf {x}})}$ and the true posterior ${\displaystyle p_{\theta }({\textbf {z}}|{\textbf {x}})}$. While less widely used that ELBO, EUBO also provides a tractable objective that can also be optimized using an iterative descent algorithm. Specifically, the EUBO objective is as follows:

${\displaystyle EUBO(q_{\phi }({\textbf {z}}|{\textbf {x}}))=KL(q_{\Gamma ,\phi }({\textbf {z}}|{\textbf {x}})||p_{\theta }({\textbf {z}}|{\textbf {x}}))+{\text{log}}(p_{\theta }({\textbf {x}}))}$
${\displaystyle EUBO(q_{\phi }({\textbf {z}}|{\textbf {x}}))=[\mathbb {E} [{\text{log}}(q_{\phi }({\textbf {z}}|{\textbf {x}})]-\mathbb {E} [{\text{log}}(p_{\theta }({\textbf {z}},{\textbf {x}}))]-{\text{log}}(p_{\theta }({\textbf {x}}))]+{\text{log}}(p_{\theta }({\textbf {x}}))}$
${\displaystyle EUBO(q_{\phi }({\textbf {z}}|{\textbf {x}}))=\mathbb {E} [{\text{log}}(q_{\phi }({\textbf {z}}|{\textbf {x}})]-\mathbb {E} [{\text{log}}(p_{\theta }({\textbf {z}},{\textbf {x}}))]}$
${\displaystyle q_{\phi ^{*}}({\textbf {z}}|{\textbf {x}})={\text{argmin}}_{\phi }EUBO(q_{\phi }({\textbf {z}}|{\textbf {x}}))}$

This places an upper bound on the objective which is also a constant, namely the evidence ${\displaystyle p_{\theta }({\textbf {x}})}$, from the true KL-divergence objective. By minimizing EUBO, we can also obtain an approximate distribution ${\displaystyle q_{\phi }({\textbf {z}}|{\textbf {x}})}$ to be used in place of the posterior.

### ELBO vs. EUBO

For readers who are interested in exploring this area, it's highly recommended to read (and if not available yet, potentially write) an entirely separate page on this topic as the difference in behaviour of the objectives can be highly dependent on the choice of the distribution family ${\displaystyle \Gamma }$. However, to give a extremely high-level overview: ELBO tends to underestimate variance while EUBO tends to overestimate variance. That is generally speaking, for mean-field distributions such as Gaussians, one can expect to see a narrower approximation than the true posterior when optimizing with ELBO while seeing a wider that true posterior distribution obtained using EUBO. Again, it must be emphasized that this is a very high-level comparison of the two behaviours and given the none-convex nature of the objectives, variable results can be observed. Thus, it is strongly recommended for interested readers to consult other resources in this regard.

## Annotated Bibliography

[1] Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association 112.518 (2017): 859–877. Crossref. Web.

[2] Kingma, Diederik P., and Max Welling. “An Introduction to Variational Autoencoders.” Foundations and Trends® in Machine Learning 12.4 (2019): 307–392. Crossref. Web.

[3] Ng, Andrew & Jordan, Michael. (2002). On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes. Adv. Neural Inf. Process. Sys. 2.

[4] Robert, Christian, and George Casella. “A Short History of Markov Chain Monte Carlo: Subjective Recollections from Incomplete Data.” Statistical Science 26.1 (2011): 102–115. Crossref. Web.

[5] Walsh, B. (2004). Markov Chain Monte Carlo and Gibbs Sampling. Lecture Notes for EEB 581, version 26, April.

[6] Robert, Christian. (2015). The Metropolis—Hastings Algorithm. 10.1007/978-1-4757-4145-2_7.