Course:CPSC522/NormalizingFlows

Title

This page covers Normalizing Flows, a class of generative model which transforms a latent distribution to match the data distribution using an invertible transformation.

Principal Author: Matthew Niedoba
Collaborators:

Abstract

Normalizing Flows are an attractive class of generative model which maps each element of a simple analytical distribution one to one to a complex data distribution through a bijective transformation. Notably, this construction allows for both sampling from the data distribution and exact likelihood computation. In this page, we introduce normalizing flows and detail their construction and application.

Builds on

An understanding of Probability and Probability general semantics is required to understand the construction of Normalizing Flows. The bijective transforms of normalizing flows are typically parameterized by Neural Networks.

Related Pages

Normalizing Flows are just member of the broad class of generative models. Other types of generative models include:

Method

Source and Target Distributions

Normalizing Flows model the data distribution by transforming the latent space (right) into the data space (left) through an invertible mapping. Source: Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016)^[1]

The aim of normalizing flows is to model a target distribution $p(\mathbf {x} )$ , $\mathbf {x} \in \mathbb {R} ^{D}$ . Since the analytic form of $p(\mathbf {x} )$ is not generally known, normalizing flows model it by transforming samples from a source distribution $p(\mathbf {z} )$ through a transformation $T$ , generally parameterized with some parameters $\theta$ .

\mathbf {x} =T(\mathbf {z} )\quad {\text{where}}\ \mathbf {z} \sim p(\mathbf {z} ).

The goal of normalizing flows is to achieve a one to one, invertible mapping between

\mathbf {x}

and

\mathbf {z}

. To this end, normalizing flows restrict the transformation

T

to diffeomorphisms - transformations which are invertible and where both

T

and

T^{-1}

are differentiable. Since

T

must be invertible, we require that

\mathbf {z}

must be the same dimensionality as

\mathbf {x}

. Using the inverse transformation, we can transform data samples to samples from the source distribution

\mathbf {z} =T^{-1}(\mathbf {x} )\quad {\text{where}}\ \mathbf {x} \sim p(\mathbf {x} ).

Likelihood Computations

The previous section illustrates how to sample data from a normalizing flow by transforming samples from a source distribution through the transformation $T$ . Another important task in generative modelling is computing the likelihood of data under the model. Due to the one to one mapping between the source and target distributions, normalizing flows allow exact computation of likelihoods through the change of variables formula

p(\mathbf {x} )=p\left(T^{-1}\left(\mathbf {x} \right)\right)\left|{\text{det}}\left[{\frac {\partial T^{-1}\left(\mathbf {x} \right)}{\partial \mathbf {x} }}\right]\right|.

Here,

{\text{det}}\left[{\frac {\partial T^{-1}\left(\mathbf {x} \right)}{\partial \mathbf {x} }}\right]

refers to the determinant of the Jacobian of

T^{-1}

with respect to

\mathbf {x}

. Since

\mathbf {z} =T^{-1}(\mathbf {x} )

, and using the identity

{\text{det}}(J^{-1})={\text{det}}(J)^{-1}

we can also write

p(\mathbf {x} )=p\left(\mathbf {z} \right)\left|{\text{det}}\left[{\frac {\partial T\left(\mathbf {z} \right)}{\partial \mathbf {z} }}\right]\right|^{-1}.

Training Objective

Like many generative models, the goal of training is to approximate the true data distribution $p(\mathbf {x} )$ with our model $p_{\theta }(\mathbf {x} )$ which transforms data from our source distribution to the target distribution through the parameterized transformation $T$ . The objective of training is to minimize the KL divergence between these two distributions. That is to minimize

L=\min _{\theta }\ D_{KL}\left(p(\mathbf {x} )||p_{\theta }(\mathbf {x} )\right)

Note that this is the so-called forward KL divergence, which is an expectation over samples from the true data distribution

p(\mathbf {x} )

, usually in the form of a dataset of examples. Noting that

D_{KL}(P||Q)=\mathbb {E} _{x\sim P(x)}\left[\log \left({\frac {P(x)}{Q(x)}}\right)\right]

, we can rewrite the loss as the sum of two expectations

L=\min _{\theta }\ \mathbb {E} _{\mathbf {x} \sim p(\mathbf {x} )}\left[\log(p(\mathbf {x} ))\right]-\mathbb {E} _{\mathbf {x} \sim p(\mathbf {x} )}\left[\log(p_{\theta }(\mathbf {x} ))\right]

Since the first term is constant with respect to the parameters, we can see that minimizing the KL divergence of a normalizing flow with the data distribution is equivalent to minimizing the negative log likelihood of the data under the normalizing flow model

L=\min _{\theta }-\mathbb {E} _{\mathbf {x} \sim p(\mathbf {x} )}\left[\log(p_{\theta }(\mathbf {x} ))\right]

During training, we compute the exact likelihood

p_{\theta }(\mathbf {x} )

by using the change of variables formula.

Finite Normalizing Flows

The key challenge in designing a normalizing flow is choosing the structure of the transformation $T$ which maps between the source in target distributions. Such a transformation must be complex enough to model the data distribution. However, generating such a complex transformation in one shot is difficult, especially when it must be invertible. Instead of computing the transformation in one shot, finite normalizing achieve the required complexity through the composition of a finite number of simpler transformations.

T=T_{N}\circ \cdots \circ T_{1}

Each transformation

T_{n}

can be thought of as a miniature normalizing flow, transforming an intermediate source distribution

p(\mathbf {z} _{n-1})

into an intermediate target distribution

p(\mathbf {z} _{n})

via the relation

\mathbf {z} _{n}=T_{n}\left(\mathbf {z} _{n-1}\right)

for

n=1,\ldots ,N

. We set

\mathbf {z} _{0}

equal to the samples from our original source distribution (usually a multivariate Gaussian) and aim to have

\mathbf {z} _{N}

match the target distribution

p(\mathbf {x} )

. With the finite normalizing flow construction, the determinant of the Jacobian of the overall transformation is equal to the product of the determinants of the Jacobians of each transform

{\text{det}}\left[{\frac {\partial T\left(\mathbf {z} \right)}{\partial \mathbf {z} }}\right]=\prod _{n=1}^{N}{\text{det}}\left[{\frac {\partial T_{n}\left(\mathbf {z_{n-1}} \right)}{\partial \mathbf {z_{n-1}} }}\right]

By selecting individual transforms which are invertible and for which the determinant of the Jacobian can be computed efficiently, we ensure that the overall normalizing flow is also invertible with an easy to compute Jacobian determinant. In the next sections, we discuss choices of transforms which have these properties and can be composed to construct more complex normalizing flows.

Coupling Flows

Coupling transformations, introduced by ^[2], aim to make the Jacobian matrix triangular by partitioning $\mathbf {z}$ into two parts, $\mathbf {z} _{a}$ and $\mathbf {z} _{b}$ . Then, we define the transformation of each partition seperately

T(\mathbf {z_{a}} )=\mathbf {z} _{a}\quad T(\mathbf {z} _{b})=\mathbf {z_{b}} +f_{\theta }(\mathbf {z} _{a})

Coupling transformations are easily invertible, and the Jacobian matrix has the form

{\frac {\partial T(\mathbf {z} )}{\partial \mathbf {z} }}={\begin{bmatrix}\mathbf {I} &\mathbf {0} \\{\frac {\partial f}{\partial \mathbf {z_{a}} }}&\mathbf {I} \end{bmatrix}}

Since the Jacobian is lower triangular, the determinant is equal to one.

The partitioning of $\mathbf {z}$ is modified with each transformation layer such that all components of $\mathbf {z}$ are transformed by the end of the flow.

RealNVP

RealNVP ^[1] extends the coupling transformations introduced in ^[2] by adding a scaling to the transformation. Specifically

T(\mathbf {z} _{b})=\mathbf {z} _{b}\cdot e^{s_{\theta }(\mathbf {z} _{a})}+t_{\theta }(\mathbf {z} _{a})

Here, the transformation is parameterized by two neural networks:

\mathbf {s} _{\theta }(\mathbf {z} _{a})

and

\mathbf {t} _{\theta }(\mathbf {z} _{a})

which control the scale and shift of the transformation. With this modification, the transformation is still easily invertible, and the Jacobian is still lower triangular

{\frac {\partial T(\mathbf {z} )}{\partial \mathbf {z} }}={\begin{bmatrix}\mathbf {I} &\mathbf {0} \\{\frac {\partial t}{\partial \mathbf {z_{a}} }}&{\text{diag}}({\text{exp}}(s_{\theta }(\mathbf {z_{a}} ))\end{bmatrix}}

where diag indicates a diagonal matrix. Since the Jacobian is lower triangular, the determinant is still the trace of the matrix.

Planar Flows

Planar flows^[3] are a type of transformation which allow for linear computation of the determinant of the Jacobian. They have the form

T(\mathbf {z} )=\mathbf {z} +\mathbf {u} h(\mathbf {w} ^{\top }\mathbf {z} +b).

Here, the parameters

\theta

of the transformation are

\mathbf {u} \in \mathbb {R} ^{D},\mathbf {w} \in \mathbb {R} ^{D},b\in \mathbb {R}

. The determinant of the Jacobian is easily computed as

\left|{\text{det}}\left[{\frac {\partial T\left(\mathbf {z} \right)}{\partial \mathbf {z} }}\right]\right|=|1+\mathbf {u} ^{\top }h'(\mathbf {w} ^{\top }\mathbf {z} +b)\mathbf {w} |

With planar flows,

T

and its determinant are easy to compute, but are not easily invertible. As a result, the authors of this method train their flow using the reverse KL. In this setup, they minimize the divergence

D(p(T_{\theta }(z))||p(x))

by drawing samples from the source distribution.

Continuous Normalizing Flows

Synthetic Celebrity portraits generated using Glow, a normalizing flow. Source: Kingma, D. P., & Dhariwal, P. (2018). ^[4]

Continuous normalizing flows^[5] consider the case of extending finite normalizing flows to an infinite number if infinitesimal transformations. If we let $\mathbf {z} _{0}$ be a variable from our source distribution and let $\mathbf {z} _{1}$ be a variable from our target distribution, then the continuous normalizing flow transforming $\mathbf {z} _{0}$ to $\mathbf {z} _{1}$ is given by

\mathbf {z} _{1}=\mathbf {z} _{0}+\int _{0}^{1}f_{\theta }(\mathbf {z} _{t},t)dt

The log density of the resulting distribution is given by the instantaneous change of variables formula, another ODE:

\log p(\mathbf {z} _{1})=\log p(\mathbf {z} _{0})-\int _{0}^{1}{\text{Tr}}({\frac {\partial f}{\partial \mathbf {z} _{t}}})dt

Notably, unlike the finite normalizing flows, computing the log density only requires evaluating the trace of the jacobian, instead of the determinant. This allows for more freedom in selecting

f

, but at the cost of using an numerical ODE solver for sampling and likelihood evaluation. Training continuous normalizing flows is also challenging, as it requires backpropagating through the ODE solver.

Applications

Normalizing Flows have primarily been used to model image data, such as in ^[2]^[1]^[4]. However, some other applications exist such as for text modelling ^[6] or audio ^[7]. Normalizing flows have become less popular recently, possibly because the requirements of invertibility and easy to compute Jacobian determinants place large restrictions on the class of transformations that can be used. Instead, many practitioners have shifted to using other generative models, such as Generative Adversarial Networks, Variational Auto Encoders or Diffusion Probabilistic Models. However, recent work has lead to renewed interest in normalizing flows due to a simplification of the training objective for continuous normalizing flows through flow matching ^[8], which may be used for a new generation of powerful normalizing flow models.

Annotated Bibliography

↑ ^1.0 ^1.1 ^1.2 Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real nvp. arXiv preprint arXiv:1605.08803.
↑ ^2.0 ^2.1 ^2.2 Dinh, L., Krueger, D., & Bengio, Y. (2014). Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516.
↑ Rezende, D., & Mohamed, S. (2015, June). Variational inference with normalizing flows. In International conference on machine learning (pp. 1530-1538). PMLR.
↑ ^4.0 ^4.1 Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31.
↑ Chen, R. T., Rubanova, Y., Bettencourt, J., & Duvenaud, D. K. (2018). Neural ordinary differential equations. Advances in neural information processing systems, 31. Chicago
↑ Tran, D., Vafa, K., Agrawal, K., Dinh, L., & Poole, B. (2019). Discrete flows: Invertible generative models of discrete data. Advances in Neural Information Processing Systems, 32.
↑ Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., ... & Hassabis, D. (2018, July). Parallel wavenet: Fast high-fidelity speech synthesis. In International conference on machine learning (pp. 3918-3926). PMLR.
↑ Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., & Le, M. (2022). Flow matching for generative modeling. arXiv preprint arXiv:2210.02747.

Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0

[:1-1] 1.0 ^1.1 ^1.2 Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real nvp. arXiv preprint arXiv:1605.08803.

[:0-2] 2.0 ^2.1 ^2.2 Dinh, L., Krueger, D., & Bengio, Y. (2014). Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516.

[3] Rezende, D., & Mohamed, S. (2015, June). Variational inference with normalizing flows. In International conference on machine learning (pp. 1530-1538). PMLR.

[:2-4] 4.0 ^4.1 Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31.

[5] Chen, R. T., Rubanova, Y., Bettencourt, J., & Duvenaud, D. K. (2018). Neural ordinary differential equations. Advances in neural information processing systems, 31. Chicago

[6] Tran, D., Vafa, K., Agrawal, K., Dinh, L., & Poole, B. (2019). Discrete flows: Invertible generative models of discrete data. Advances in Neural Information Processing Systems, 32.

[7] Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., ... & Hassabis, D. (2018, July). Parallel wavenet: Fast high-fidelity speech synthesis. In International conference on machine learning (pp. 3918-3926). PMLR.

[8] Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., & Le, M. (2022). Flow matching for generative modeling. arXiv preprint arXiv:2210.02747.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]