Jump to content

Course:CPSC522/Sequential Monte Carlo samplers

From UBC Wiki

Sequential Monte Carlo samplers

This page first discusses the Sequential Monte Carlo technique[1] for performing "online" Bayesian inference - inference performed continuously as observations are made. We then describe Sequential Monte Carlo samplers[2], which build on this technique to develop an algorithm for efficient "offline" inference.

Principal Author: William Harvey
Collaborators:

Abstract

Sequential Monte Carlo allows for fast online inference by representing the posterior distribution at each step with a set of samples. Since each distribution in the sequence is typically closely related to the previous distribution, it is possible to use the set of samples from the posterior at time t as a starting point for estimating the posterior at time t+1. This is the key to the Sequential Monte Carlo approach. A more recent development concerns the use of this algorithm for offline inference: when we wish to approximately sample from a single distribution. By defining this distribution as simply the last distribution in a sequence, it is possible to leverage Sequential Monte Carlo as a powerful and general-purpose inference algorithm.

Builds on

Probability is fundamental as background for Sequential Monte Carlo (SMC). Building on probability, Markov chain Monte Carlo methods are used as an important building block of the SMC algorithm. Also, note that SMC is a generalisation of particle filtering (which considers only state-space models) to any dynamic system.

Related Pages

As a powerful general method for Bayesian inference, SMC can be used in models such as HMMs.

Sequential Monte Carlo for Online Inference

Sequential Monte Carlo[1] (SMC) has been developed as a general method for inference in models where information is obtained sequentially. It allows a posterior distribution over unobserved variables to be estimated as the information is obtained, and efficiently updated for each new observation. As with a particle filter, the core idea is to represent the posterior with a set of approximate samples from it. This idea allows the representation of arbitrarily complex and multi-modal posteriors over the states of systems with arbitrarily complex dynamics. It is defined for systems which can be specified as follows:

Model Definition

The model can be described as a sequence of probability distributions πt(𝐱t) for t=0,1,2, Each of these refers to the posterior distribution over the latent variables, 𝐱t, given the observations made up until time t. Note that the dimension of 𝐱t can change with t (as in the example below, where it increases with t to allow it to store all states 1,,t).

State-space example

To give a concrete example of such a model fitting the above definition, we now describe a state-space model. This has a d-dimensional latent state, xt, which evolves over time with known stochastic dynamics, p(xt+1|xt). Observations of the system, yt, are made according to p(yt|xt), and used to condition the estimate of x. x is initially distributed according to p(x0). We define 𝐱t to be the set of all xi for i=0,1,,t, so that x𝐭 essentially contains the "history" of the particle. Note that this is a special case and, in SMC in general, x𝐭does not "remember" all previous states. Similarly, we denote 𝐲t to be the set of all yi for i=0,1,,t. Note that this means that the dimension of 𝐱t increases by d with each increase in t. This gives the following expression for the joint distribution of 𝐱t and y1:t:

p(𝐱t,𝐲t)=1Zp(x0)i=1tp(yi|xi)p(xi|xi1)

Since the posterior πt(𝐱t) is proportional to the joint probability (down to a normalising constant independent of 𝐱t), we can write it as:

πt(𝐱t)=1Zp(x0)i=1tp(yi|xi)p(xi|xi1)

In the case that each distribution is Gaussian, with variance independent of xt, and the means of p(yt|xt) and p(xt+1|xt) are affine transformations of xt, it is possible to calculate this posterior analytically with a Kalman filter. When these conditions are not met, SMC can be used to instead compute set of samples, with which expectations over the posterior can be approximated. Note that, in this example, SMC is equivalent to using a particle filter.

Inference

SMC involves performing inference by maintaining a set of m samples, 𝐱t(j) for j=0,,m, which approximate expectations over πt(𝐱t). At the beginning of inference, these are initialised, e.g. by sampling from a prior before making any observations. Thereafter, they are updated at each time-step to approximate πt+1(𝐱t+1) rather than πt(𝐱t). In this section, we discuss two different procedures for updating the samples, although many more exist. A rough outline of the procedure is as follows:

  1. Intialise samples (e.g. by sampling from a prior over the initial state)
  2. for each timestep:
    1. transition approximate samples of πt1(𝐱t1) to samples of πt(𝐱t)
    2. optional: resample πt(𝐱t) to increase effective sample size

Sequential Importance Sampling

For the discussion of sequential importance sampling (SIS), we will consider the case where 𝐱t grows in dimension with t, and so 𝐱t+1 can be written as {xt+1,x𝐭}. Liu et al.[1] advocate sampling each xt+1(j) from p(xt+1|𝐲t+1𝐱t(j)) for each j, where 𝐱t(j) is the jth sample from πt(𝐱t). This yields samples 𝐱t+1(j)={xt+1(j),x𝐭(j)} from the distribution:

πt(𝐱t)p(xt+1|𝐲t+1,𝐱t)

However, we want samples which approximate the distribution:

πt+1(𝐱t+1)=πt+1(𝐱t)p(xt+1|𝐲t+1,𝐱t)

Intuitively, our samples do not condition any of 𝐱t(j) on the new observation, yt+1. To correct for this, the samples are weighted. By weighting samples, we can approximate expectations over a distribution, p(x), with weighted samples from q(x). The approximation is accurate in the sense that, for properly chosen weights:

𝔼p(x)[f(x)]=𝔼q(x)[w(x)f(x)]

It can be shown that the weights which give this property are given by w(x)=p(x)q(x), so can be used whenever the probability density functions of the true and approximating function are known. If expectations are normalised by the expected value of the weights, the weights can be calculated using unnormalised densities:

𝔼p(x)[f(x)]=𝔼q(x)[w(x)f(x)]𝔼[w(x)]

Since we want to approximate πt+1(𝐱t+1) with samples from πt(𝐱t)p(xt+1|𝐲t+1,𝐱t), we must weight them by:

w(𝐱t+1)=πt+1(𝐱t+1)πt(𝐱t)p(xt+1|𝐲t+1,𝐱t)=πt+1(𝐱t)p(xt+1|𝐲t+1,𝐱t)πt(𝐱t)p(xt+1|𝐲t+1,𝐱t)=πt+1(𝐱t)πt(𝐱t)

We then have a set of weighted samples, {𝐱t+1(j),wt+1(j)} for j=0,,m, which approximate πt+1(𝐱t+1). These weights can be used when it is possible to sample exactly from p(xt+1|𝐲t+1𝐱t(j)). If we are not able to, we can instead sample from an approximation, q(xt+1|𝐲t+1𝐱t(j)), and calculate weights so that the approximation of πt+1(𝐱t+1) is still exact (in the limit as m). This leads to weights of the form:

w(𝐱t+1)=πt+1(𝐱t)p(xt+1|𝐲t+1,𝐱t)πt(𝐱t)q(xt+1|𝐲t+1,𝐱t)

These weights may be kept and multiplied by later weights to approximate the posterior at later timesteps. Alternatively, the posterior can be resampled (as discussed later on), allowing the weights to be discarded.

Rejection Sampling

Rejection sampling is based on a similar idea to SIS but, instead of storing weights, samples are accepted with a probability proportional to the weight. Samples are continually drawn until m have been accepted. The procedure is as follows:

  1. Sample an index, j, with probability proportional to wt(j) (or from a uniform distribution if the samples are not weighted)
  2. Draw xt+1(j) from g(xt+1|𝐱t(j),𝐲t+1), which should approximate πt+1(xt+1|𝐱t(j)) as well as possible for efficiency
  3. Accept (j, xt+1(j)) with probability p=πt+1(𝐱t(j),xt+1)ct+1πt(𝐱t(j))g(xt+1|𝐱t,𝐲t+1) where ct+1 is a constant chosen to ensure that p1.

This procedure is performed repeatedly until m samples have been accepted. Variations of this method exist which can improve efficiency (e.g. the proportion of samples of accepted) with techniques such as Rao-Blackwellisation.

Resampling

One issue with the sequential importance sampling approach (and other, related approaches), is that the variance of the weights can grow large when they are continually changed over many time steps. Therefore, although we may have m samples of πt(𝐱t), many of them may have a very small weight, and so our approximation could be dominated by very few samples (or even just one). One method to avoid this problem is to resample them from an empirical distribution given by:

π^t(𝐱t)=j=1mwt(j)j=1mwt(j)δ𝐱t(j)(𝐱t)

where δ is the Kronecker delta. This is simple to sample from by sampling from a categorical distribution over j with probabilities proportional to the weights, and maintains the property that 𝐱t(j) are exact samples from πt(𝐱t)in the limit as m. Since this distribution, can be sampled from exactly, the resulting samples do not have weights, and so the resampling step gets rid of the weights, increasing the effective sample size. This procedure effectively gets rid of samples with small weights, and duplicates samples with high weights. The duplicated samples will diverge at the next timestep, so this procedure allows the most "likely" regions of sample space to be explored more. However, the resampling procedure also introduces noise to the procedure, and so it is typically not recommended to perform it at every step. Liu et al.[1] suggest either resampling at fixed intervals (with a size specific to the problem), or checking the effective sample size (a measure of the degeneracy of the weights) after each transition, and resampling when this decreases below a threshold. Also, note that the resampling step is the simplest technique, but other techniques exist which can add less noise.

Sequential Monte Carlo samplers

As described in the previous section, SMC provides an efficient method for sampling from a sequence of related distributions, making it suitable for situations such as performing inference as observations are made. Del Moral et al.[2] suggest using SMC even when only a single distribution is to be inferred. Whereas Liu et al.[1] considered a sequence of πt(𝐱t) (and focused on the case where 𝐱tincreases in dimension with t), Del Moral et al.[2] focus on the case where we only wish to infer π(𝐱) (which we can evaluate down to a normalising constant). To allow SMC to be used, an artificial sequence of p distributions is constructed from π(𝐱). We shall denote this sequence as π1(𝐱),,πp(𝐱). Note that 𝐱 is now constant in dimension across all distributions. Several methods are discussed for creating this sequence:

  1. If π(𝐱) is a posterior distribution conditioned on p variables, y1,,p, we can construct a series of distributions by setting π1(𝐱) to the prior and then conditioning each distribution on one more observation than the previous: πn+1(𝐱)=p(𝐱|y1,,n) for n=1,,p
  2. In order to sample from an intractable distribution, π(𝐱), we could construct a "smooth" path to it from μ(𝐱), some tractable initial distribution, as follows: πn(𝐱)=μ(x)1ϕn(x)π(x)ϕn(x)where 0ϕ1ϕp=1

Other sequences are suggested if we wish to find the mode of π(𝐱) or estimate the probability of a rare event, rather than sampling from it:

  1. To converge to the mode of π(𝐱), we can use the sequence πn=π(x)ϕn where ϕn as n increases
  2. If π(𝐱) is simple to sample from and we wish to estimate π(A)=(𝐱A) for a rare event, A, such that π(A)0, we can use the sequence πn(𝐱)=π(𝐱)𝟏An(𝐱) where A1An=A, and the estimate of π(A) is obtained form the normalising constant during inference.

Transition Kernels

As in Liu et al.[1], SMC samplers requires some kernel to transition from a sample from πn(𝐱) to a sample from πn+1(𝐱). In Liu et al.[1], we discussed kernels consisting of SIS and rejection sampling. However, these were only suitable for the case when the dimension of 𝐱 was increasing. Since the dimension of 𝐱 is constant in the case of SMC samplers, these are no longer suitable. For the discussion of this paper, we denote the transition kernel by Kn(𝐱,𝐱), which transitions from 𝐱, a sample from πn1, to 𝐱, a sample from πn. In order to minimise the variance of the weights, the optimal transition would be Kn(𝐱,𝐱)=πn(𝐱). However, this is typically not possible and so it is common to use a local move from 𝐱: Del Moral et al.[2] suggest using either a Markov chain Monte Carlo (MCMC) transition or an approximate Gibbs step. We will now describe these options in more detail.

Markov chain Monte Carlo:

Kn(𝐱,𝐱) may be set to be an MCMC kernel with πn(𝐱) as a stationary distribution. This means that, if 𝐱 is a sample from πn(𝐱), 𝐱 will also be a sample from πn(𝐱) (although it will not be independent of 𝐱). Since 𝐱 is a sample from πn1(𝐱) instead, 𝐱 will not be a perfect sample from πn(𝐱), but will be "close" if πn is close to πn1 (and 𝐱 will be "closer" to πn than 𝐱 was). The discrepancy is then accounted for in the weights (which are discussed later). A vast literature exists for constructing good MCMC kernels for arbitrary stationary distributions [3][4], and so this choice can take advantage of this literature to design an efficient kernel, and so approximate πn(𝐱) well.

Approximate Gibbs Step

This corresponds to the SIS and rejection sampling methods discussed above, where one variable (e.g. xn) is sampled conditioned on all the other variables (e.g. 𝐱n1). This is an exact Gibbs step if xn is sampled from p(xt+1|𝐲t+1𝐱t(j)) exactly, or an approximate step is SIS or an MCMC approach is used. Note that, since the dimension of 𝐱 is constant in an SMC sampler, the sampled value will overwrite a previous value, rather than being appended to 𝐱n1 as when this was discussed above.

Weights

Ideally, it would be possible to calculate weights for each 𝐱n(j) as πn(𝐱n(j))γn(𝐱n(j)), where γn(𝐱n(j)) is the density function of the "proposal" distribution for 𝐱n(j) given by:

γn(𝐱n(j))=γ1(𝐱1(j))n=2nKn(𝐱n1(j),𝐱n(j))d𝐱1,,d𝐱n1.

However, the marginalisation over 𝐱1,,𝐱n1 is typically intractable and so it is impossible to calculate the weights in this form. To avoid this marginalisation, we consider the samples as samples of 𝐱1,,n. from π1,,n, rather than just 𝐱n from πn. This means that the weights are given by:

w1:n(j)=π1,,n(𝐱1,,n(j))η1,,n(𝐱1,,n(j)),

where η1,,n(𝐱1,,n(j))=η1(𝐱1(j))n=2nKn(𝐱n1(j),𝐱n(j)). Since πn is a marginal of π1,,n, sampling from the joint distribution will yield samples from πn. Since it is not always possible to evaulate Kn(𝐱n1,𝐱n) (for example, the rejection probability of an MCMC kernel is usually intractable), Del Moral et al.[2] propose to express η1,,n(𝐱1,,n(j)) using artificial "backward" kernels Ln, such that:

η1,,n(𝐱1,,n(j))=ηn(𝐱n(j))n=2nLn(𝐱n1(j),𝐱n(j))

Computing Ln such that the above equation gives the "true" proposal distribution is still not necessarily tractable. However, Del Moral et al.[2] use the insight that the weights are only required to define an asymptotically correct marginal distribution over 𝐱n, and the distribution over 𝐱1:n1 is unimportant. This means that Ln need not be an accurate inverse of the transitions for the sampler to yield asymptotically consistent estimates (although the choice of Ln will affect the variance of the weights). For a given Ln, the weights of particle j is updated after step n by multiplying by:

w~(𝐱n1(j),𝐱n(j))=γn(𝐱n(j))Ln1(𝐱n1(j),𝐱n(j))γn1(𝐱n1(j))Kn(𝐱n1(j),𝐱n(j))

where γi(𝐱) is a (potentially) unnormalised density function of the ith distribution, πi(𝐱). Depending on the choice of Ln1(𝐱n1,𝐱n), Kn(𝐱n1,𝐱n) may cancel out and so this technique can be used even when Kn(𝐱n1,𝐱n) is intractable.

Algorithm

The algorithm for the SMC sampler is defined as follows:

  1. Define Ln1(xn1,xn), Kn(xn1,xn) sequence of π1(𝐱),,πp(𝐱)
  2. Initialization:
    • set n=1
    • for j=1,,m draw 𝐱1(j)η1
    • set {w1(j)}={π1(𝐱1(j))η1(𝐱1(j))} and normalise to obtain {W1(j)}
    • iterate through steps 2 and 4
  3. Resampling:
    • If effective sample size less than some threshold, resample and set Wn(j)=1m
  4. Sampling:
    • set n=n+1
    • if n>p, stop
    • for j=1,,m draw 𝐱n(j)Kn(𝐱n1(j),.)
    • set {wn(j)}={Wn1(j)γn(𝐱n(j))Ln1(𝐱n1(j),𝐱n(j))γn1(𝐱n1(j))Kn(𝐱n1(j),𝐱n(j))} and normalise to obtain {Wn(j)}

My Thoughts on the Contributions

Del Moral et al.[2] suggest that SMC can be used even when inference is only required over a single distribution, rather than a series of distributions. A very powerful application of this is for inference in well-defined models, in addition to the likes of Metropolis-Hastings and Gibbs sampling, and SMC can sometimes provide significantly improved inference compared to these methods. One potential issue is that it is not necessarily clear how best to select parameters such as the transition kernels, and how to define the sequence of distributions (which the quality of inference is presumably heavily dependent on). Despite this, I believe that the introduction of the framework is a valuable contribution.

Annotated Bibliography

  1. 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Liu JS, Chen R. Sequential Monte Carlo methods for dynamic systems. Journal of the American statistical association. 1998 Sep 1;93(443):1032-44.
  2. 2.0 2.1 2.2 2.3 2.4 2.5 2.6 Del Moral P, Doucet A, Jasra A. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006 Jun 1;68(3):411-36.
  3. Hoffman MD, Gelman A. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research. 2014 Apr 1;15(1):1593-623.
  4. Haario H, Saksman E, Tamminen J. An adaptive Metropolis algorithm. Bernoulli. 2001;7(2):223-42.


Some rights reserved
Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0