Course:CPSC522/Sequential Monte Carlo samplers
- 1 Sequential Monte Carlo samplers
- 2 Abstract
- 3 Sequential Monte Carlo for Online Inference
- 4 Sequential Monte Carlo samplers
- 5 My Thoughts on the Contributions
- 6 Annotated Bibliography
Sequential Monte Carlo samplers
This page first discusses the Sequential Monte Carlo technique for performing "online" Bayesian inference - inference performed continuously as observations are made. We then describe Sequential Monte Carlo samplers, which build on this technique to develop an algorithm for efficient "offline" inference.
Principal Author: William Harvey
Sequential Monte Carlo allows for fast online inference by representing the posterior distribution at each step with a set of samples. Since each distribution in the sequence is typically closely related to the previous distribution, it is possible to use the set of samples from the posterior at time as a starting point for estimating the posterior at time . This is the key to the Sequential Monte Carlo approach. A more recent development concerns the use of this algorithm for offline inference: when we wish to approximately sample from a single distribution. By defining this distribution as simply the last distribution in a sequence, it is possible to leverage Sequential Monte Carlo as a powerful and general-purpose inference algorithm.
Probability is fundamental as background for Sequential Monte Carlo (SMC). Building on probability, Markov chain Monte Carlo methods are used as an important building block of the SMC algorithm. Also, note that SMC is a generalisation of particle filtering (which considers only state-space models) to any dynamic system.
As a powerful general method for Bayesian inference, SMC can be used in models such as HMMs.
Sequential Monte Carlo for Online Inference
Sequential Monte Carlo (SMC) has been developed as a general method for inference in models where information is obtained sequentially. It allows a posterior distribution over unobserved variables to be estimated as the information is obtained, and efficiently updated for each new observation. As with a particle filter, the core idea is to represent the posterior with a set of approximate samples from it. This idea allows the representation of arbitrarily complex and multi-modal posteriors over the states of systems with arbitrarily complex dynamics. It is defined for systems which can be specified as follows:
The model can be described as a sequence of probability distributions for Each of these refers to the posterior distribution over the latent variables, , given the observations made up until time . Note that the dimension of can change with (as in the example below, where it increases with to allow it to store all states ).
To give a concrete example of such a model fitting the above definition, we now describe a state-space model. This has a -dimensional latent state, , which evolves over time with known stochastic dynamics, . Observations of the system, , are made according to , and used to condition the estimate of . is initially distributed according to . We define to be the set of all for , so that essentially contains the "history" of the particle. Note that this is a special case and, in SMC in general, does not "remember" all previous states. Similarly, we denote to be the set of all for . Note that this means that the dimension of increases by with each increase in . This gives the following expression for the joint distribution of and :
Since the posterior is proportional to the joint probability (down to a normalising constant independent of ), we can write it as:
In the case that each distribution is Gaussian, with variance independent of , and the means of and are affine transformations of , it is possible to calculate this posterior analytically with a Kalman filter. When these conditions are not met, SMC can be used to instead compute set of samples, with which expectations over the posterior can be approximated. Note that, in this example, SMC is equivalent to using a particle filter.
SMC involves performing inference by maintaining a set of samples, for , which approximate expectations over . At the beginning of inference, these are initialised, e.g. by sampling from a prior before making any observations. Thereafter, they are updated at each time-step to approximate rather than . In this section, we discuss two different procedures for updating the samples, although many more exist. A rough outline of the procedure is as follows:
- Intialise samples (e.g. by sampling from a prior over the initial state)
- for each timestep:
- transition approximate samples of to samples of
- optional: resample to increase effective sample size
Sequential Importance Sampling
For the discussion of sequential importance sampling (SIS), we will consider the case where grows in dimension with , and so can be written as . Liu et al. advocate sampling each from for each , where is the th sample from . This yields samples from the distribution:
However, we want samples which approximate the distribution:
Intuitively, our samples do not condition any of on the new observation, . To correct for this, the samples are weighted. By weighting samples, we can approximate expectations over a distribution, , with weighted samples from . The approximation is accurate in the sense that, for properly chosen weights:
It can be shown that the weights which give this property are given by , so can be used whenever the probability density functions of the true and approximating function are known. If expectations are normalised by the expected value of the weights, the weights can be calculated using unnormalised densities:
Since we want to approximate with samples from , we must weight them by:
We then have a set of weighted samples, for , which approximate . These weights can be used when it is possible to sample exactly from . If we are not able to, we can instead sample from an approximation, , and calculate weights so that the approximation of is still exact (in the limit as ). This leads to weights of the form:
These weights may be kept and multiplied by later weights to approximate the posterior at later timesteps. Alternatively, the posterior can be resampled (as discussed later on), allowing the weights to be discarded.
Rejection sampling is based on a similar idea to SIS but, instead of storing weights, samples are accepted with a probability proportional to the weight. Samples are continually drawn until have been accepted. The procedure is as follows:
- Sample an index, , with probability proportional to (or from a uniform distribution if the samples are not weighted)
- Draw from , which should approximate as well as possible for efficiency
- Accept (, ) with probability where is a constant chosen to ensure that .
This procedure is performed repeatedly until samples have been accepted. Variations of this method exist which can improve efficiency (e.g. the proportion of samples of accepted) with techniques such as Rao-Blackwellisation.
One issue with the sequential importance sampling approach (and other, related approaches), is that the variance of the weights can grow large when they are continually changed over many time steps. Therefore, although we may have samples of , many of them may have a very small weight, and so our approximation could be dominated by very few samples (or even just one). One method to avoid this problem is to resample them from an empirical distribution given by:
where is the Kronecker delta. This is simple to sample from by sampling from a categorical distribution over with probabilities proportional to the weights, and maintains the property that are exact samples from in the limit as . Since this distribution, can be sampled from exactly, the resulting samples do not have weights, and so the resampling step gets rid of the weights, increasing the effective sample size. This procedure effectively gets rid of samples with small weights, and duplicates samples with high weights. The duplicated samples will diverge at the next timestep, so this procedure allows the most "likely" regions of sample space to be explored more. However, the resampling procedure also introduces noise to the procedure, and so it is typically not recommended to perform it at every step. Liu et al. suggest either resampling at fixed intervals (with a size specific to the problem), or checking the effective sample size (a measure of the degeneracy of the weights) after each transition, and resampling when this decreases below a threshold. Also, note that the resampling step is the simplest technique, but other techniques exist which can add less noise.
Sequential Monte Carlo samplers
As described in the previous section, SMC provides an efficient method for sampling from a sequence of related distributions, making it suitable for situations such as performing inference as observations are made. Del Moral et al. suggest using SMC even when only a single distribution is to be inferred. Whereas Liu et al. considered a sequence of (and focused on the case where increases in dimension with ), Del Moral et al. focus on the case where we only wish to infer (which we can evaluate down to a normalising constant). To allow SMC to be used, an artificial sequence of distributions is constructed from . We shall denote this sequence as . Note that is now constant in dimension across all distributions. Several methods are discussed for creating this sequence:
- If is a posterior distribution conditioned on variables, , we can construct a series of distributions by setting to the prior and then conditioning each distribution on one more observation than the previous: for
- In order to sample from an intractable distribution, , we could construct a "smooth" path to it from , some tractable initial distribution, as follows: where
Other sequences are suggested if we wish to find the mode of or estimate the probability of a rare event, rather than sampling from it:
- To converge to the mode of , we can use the sequence where as increases
- If is simple to sample from and we wish to estimate for a rare event, , such that , we can use the sequence where , and the estimate of is obtained form the normalising constant during inference.
As in Liu et al., SMC samplers requires some kernel to transition from a sample from to a sample from . In Liu et al., we discussed kernels consisting of SIS and rejection sampling. However, these were only suitable for the case when the dimension of was increasing. Since the dimension of is constant in the case of SMC samplers, these are no longer suitable. For the discussion of this paper, we denote the transition kernel by , which transitions from , a sample from , to , a sample from . In order to minimise the variance of the weights, the optimal transition would be . However, this is typically not possible and so it is common to use a local move from : Del Moral et al. suggest using either a Markov chain Monte Carlo (MCMC) transition or an approximate Gibbs step. We will now describe these options in more detail.
Markov chain Monte Carlo:
may be set to be an MCMC kernel with as a stationary distribution. This means that, if is a sample from , will also be a sample from (although it will not be independent of ). Since is a sample from instead, will not be a perfect sample from , but will be "close" if is close to (and will be "closer" to than was). The discrepancy is then accounted for in the weights (which are discussed later). A vast literature exists for constructing good MCMC kernels for arbitrary stationary distributions , and so this choice can take advantage of this literature to design an efficient kernel, and so approximate well.
Approximate Gibbs Step
This corresponds to the SIS and rejection sampling methods discussed above, where one variable (e.g. ) is sampled conditioned on all the other variables (e.g. ). This is an exact Gibbs step if is sampled from exactly, or an approximate step is SIS or an MCMC approach is used. Note that, since the dimension of is constant in an SMC sampler, the sampled value will overwrite a previous value, rather than being appended to as when this was discussed above.
Ideally, it would be possible to calculate weights for each as , where is the density function of the "proposal" distribution for given by:
However, the marginalisation over is typically intractable and so it is impossible to calculate the weights in this form. To avoid this marginalisation, we consider the samples as samples of . from , rather than just from . This means that the weights are given by:
where . Since is a marginal of , sampling from the joint distribution will yield samples from . Since it is not always possible to evaulate (for example, the rejection probability of an MCMC kernel is usually intractable), Del Moral et al. propose to express using artificial "backward" kernels , such that:
Computing such that the above equation gives the "true" proposal distribution is still not necessarily tractable. However, Del Moral et al. use the insight that the weights are only required to define an asymptotically correct marginal distribution over , and the distribution over is unimportant. This means that need not be an accurate inverse of the transitions for the sampler to yield asymptotically consistent estimates (although the choice of will affect the variance of the weights). For a given , the weights of particle is updated after step by multiplying by:
where is a (potentially) unnormalised density function of the th distribution, . Depending on the choice of , may cancel out and so this technique can be used even when is intractable.
The algorithm for the SMC sampler is defined as follows:
- Define , sequence of
- for draw
- set and normalise to obtain
- iterate through steps 2 and 4
- If effective sample size less than some threshold, resample and set
- if , stop
- for draw
- set and normalise to obtain
My Thoughts on the Contributions
Del Moral et al. suggest that SMC can be used even when inference is only required over a single distribution, rather than a series of distributions. A very powerful application of this is for inference in well-defined models, in addition to the likes of Metropolis-Hastings and Gibbs sampling, and SMC can sometimes provide significantly improved inference compared to these methods. One potential issue is that it is not necessarily clear how best to select parameters such as the transition kernels, and how to define the sequence of distributions (which the quality of inference is presumably heavily dependent on). Despite this, I believe that the introduction of the framework is a valuable contribution.
- Liu JS, Chen R. Sequential Monte Carlo methods for dynamic systems. Journal of the American statistical association. 1998 Sep 1;93(443):1032-44.
- Del Moral P, Doucet A, Jasra A. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006 Jun 1;68(3):411-36.
- Hoffman MD, Gelman A. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research. 2014 Apr 1;15(1):1593-623.
- Haario H, Saksman E, Tamminen J. An adaptive Metropolis algorithm. Bernoulli. 2001;7(2):223-42.