# Course:CPSC522/Sequential Monte Carlo samplers

## Sequential Monte Carlo samplers

This page first discusses the Sequential Monte Carlo technique[1] for performing "online" Bayesian inference - inference performed continuously as observations are made. We then describe Sequential Monte Carlo samplers[2], which build on this technique to develop an algorithm for efficient "offline" inference.

Principal Author: William Harvey
Collaborators:

## Abstract

Sequential Monte Carlo allows for fast online inference by representing the posterior distribution at each step with a set of samples. Since each distribution in the sequence is typically closely related to the previous distribution, it is possible to use the set of samples from the posterior at time ${\displaystyle t}$ as a starting point for estimating the posterior at time ${\displaystyle t+1}$. This is the key to the Sequential Monte Carlo approach. A more recent development concerns the use of this algorithm for offline inference: when we wish to approximately sample from a single distribution. By defining this distribution as simply the last distribution in a sequence, it is possible to leverage Sequential Monte Carlo as a powerful and general-purpose inference algorithm.

### Builds on

Probability is fundamental as background for Sequential Monte Carlo (SMC). Building on probability, Markov chain Monte Carlo methods are used as an important building block of the SMC algorithm. Also, note that SMC is a generalisation of particle filtering (which considers only state-space models) to any dynamic system.

### Related Pages

As a powerful general method for Bayesian inference, SMC can be used in models such as HMMs.

## Sequential Monte Carlo for Online Inference

Sequential Monte Carlo[1] (SMC) has been developed as a general method for inference in models where information is obtained sequentially. It allows a posterior distribution over unobserved variables to be estimated as the information is obtained, and efficiently updated for each new observation. As with a particle filter, the core idea is to represent the posterior with a set of approximate samples from it. This idea allows the representation of arbitrarily complex and multi-modal posteriors over the states of systems with arbitrarily complex dynamics. It is defined for systems which can be specified as follows:

### Model Definition

The model can be described as a sequence of probability distributions ${\displaystyle \pi _{t}(\mathbf {x} _{t})}$ for ${\displaystyle t=0,1,2,\ldots }$ Each of these refers to the posterior distribution over the latent variables, ${\displaystyle \mathbf {x} _{t}}$, given the observations made up until time ${\displaystyle t}$. Note that the dimension of ${\displaystyle \mathbf {x} _{t}}$ can change with ${\displaystyle t}$ (as in the example below, where it increases with ${\displaystyle t}$ to allow it to store all states ${\displaystyle 1,\ldots ,t}$).

#### State-space example

To give a concrete example of such a model fitting the above definition, we now describe a state-space model. This has a ${\displaystyle d}$-dimensional latent state, ${\displaystyle x_{t}}$, which evolves over time with known stochastic dynamics, ${\displaystyle p(x_{t+1}|x_{t})}$. Observations of the system, ${\displaystyle y_{t}}$, are made according to ${\displaystyle p(y_{t}|x_{t})}$, and used to condition the estimate of ${\displaystyle x}$. ${\displaystyle x}$ is initially distributed according to ${\displaystyle p(x_{0})}$. We define ${\displaystyle \mathbf {x} _{t}}$ to be the set of all ${\displaystyle x_{i}}$ for ${\displaystyle i=0,1,\ldots ,t}$, so that ${\displaystyle \mathbf {x_{t}} }$ essentially contains the "history" of the particle. Note that this is a special case and, in SMC in general, ${\displaystyle \mathbf {x_{t}} }$does not "remember" all previous states. Similarly, we denote ${\displaystyle \mathbf {y} _{t}}$ to be the set of all ${\displaystyle y_{i}}$ for ${\displaystyle i=0,1,\ldots ,t}$. Note that this means that the dimension of ${\displaystyle \mathbf {x} _{t}}$ increases by ${\displaystyle d}$ with each increase in ${\displaystyle t}$. This gives the following expression for the joint distribution of ${\displaystyle \mathbf {x} _{t}}$ and ${\displaystyle y_{1:t}}$:

${\displaystyle p(\mathbf {x} _{t},\mathbf {y} _{t})={\frac {1}{Z}}p(x_{0})\prod _{i=1}^{t}p(y_{i}|x_{i})p(x_{i}|x_{i-1})}$

Since the posterior ${\displaystyle \pi _{t}(\mathbf {x} _{t})}$ is proportional to the joint probability (down to a normalising constant independent of ${\displaystyle \mathbf {x} _{t}}$), we can write it as:

${\displaystyle \pi _{t}(\mathbf {x} _{t})={\frac {1}{Z}}p(x_{0})\prod _{i=1}^{t}p(y_{i}|x_{i})p(x_{i}|x_{i-1})}$

In the case that each distribution is Gaussian, with variance independent of ${\displaystyle x_{t}}$, and the means of ${\displaystyle p(y_{t}|x_{t})}$ and ${\displaystyle p(x_{t+1}|x_{t})}$ are affine transformations of ${\displaystyle x_{t}}$, it is possible to calculate this posterior analytically with a Kalman filter. When these conditions are not met, SMC can be used to instead compute set of samples, with which expectations over the posterior can be approximated. Note that, in this example, SMC is equivalent to using a particle filter.

### Inference

SMC involves performing inference by maintaining a set of ${\displaystyle m}$ samples, ${\displaystyle \mathbf {x} _{t}^{(j)}}$ for ${\displaystyle j=0,\ldots ,m}$, which approximate expectations over ${\displaystyle \pi _{t}(\mathbf {x} _{t})}$. At the beginning of inference, these are initialised, e.g. by sampling from a prior before making any observations. Thereafter, they are updated at each time-step to approximate ${\displaystyle \pi _{t+1}(\mathbf {x} _{t+1})}$ rather than ${\displaystyle \pi _{t}(\mathbf {x} _{t})}$. In this section, we discuss two different procedures for updating the samples, although many more exist. A rough outline of the procedure is as follows:

1. Intialise samples (e.g. by sampling from a prior over the initial state)
2. for each timestep:
1. transition approximate samples of ${\displaystyle \pi _{t-1}(\mathbf {x} _{t-1})}$ to samples of ${\displaystyle \pi _{t}(\mathbf {x} _{t})}$
2. optional: resample ${\displaystyle \pi _{t}(\mathbf {x} _{t})}$ to increase effective sample size

#### Sequential Importance Sampling

For the discussion of sequential importance sampling (SIS), we will consider the case where ${\displaystyle \mathbf {x} _{t}}$ grows in dimension with ${\displaystyle t}$, and so ${\displaystyle \mathbf {x} _{t+1}}$ can be written as ${\displaystyle \{x_{t+1},\mathbf {x_{t}} \}}$. Liu et al.[1] advocate sampling each ${\displaystyle x_{t+1}^{(j)}}$ from ${\displaystyle p(x_{t+1}|\mathbf {y} _{t+1}\mathbf {x} _{t}^{(j)})}$ for each ${\displaystyle j}$, where ${\displaystyle \mathbf {x} _{t}^{(j)}}$ is the ${\displaystyle j}$th sample from ${\displaystyle \pi _{t}(\mathbf {x} _{t})}$. This yields samples ${\displaystyle \mathbf {x} _{t+1}^{(j)}=\{x_{t+1}^{(j)},\mathbf {x_{t}} ^{(j)}\}}$ from the distribution:

${\displaystyle \pi _{t}(\mathbf {x} _{t})p(x_{t+1}|\mathbf {y} _{t+1},\mathbf {x} _{t})}$

However, we want samples which approximate the distribution:

${\displaystyle \pi _{t+1}(\mathbf {x} _{t+1})=\pi _{t+1}(\mathbf {x} _{t})p(x_{t+1}|\mathbf {y} _{t+1},\mathbf {x} _{t})}$

Intuitively, our samples do not condition any of ${\displaystyle \mathbf {x} _{t}^{(j)}}$ on the new observation, ${\displaystyle y_{t+1}}$. To correct for this, the samples are weighted. By weighting samples, we can approximate expectations over a distribution, ${\displaystyle p(x)}$, with weighted samples from ${\displaystyle q(x)}$. The approximation is accurate in the sense that, for properly chosen weights:

${\displaystyle \mathbb {E} _{p(x)}\left[f(x)\right]=\mathbb {E} _{q(x)}\left[w(x)f(x)\right]}$

It can be shown that the weights which give this property are given by ${\displaystyle w(x)={\frac {p(x)}{q(x)}}}$, so can be used whenever the probability density functions of the true and approximating function are known. If expectations are normalised by the expected value of the weights, the weights can be calculated using unnormalised densities:

${\displaystyle \mathbb {E} _{p(x)}\left[f(x)\right]={\frac {\mathbb {E} _{q(x)}\left[w(x)f(x)\right]}{\mathbb {E} \left[w(x)\right]}}}$

Since we want to approximate ${\displaystyle \pi _{t+1}(\mathbf {x} _{t+1})}$ with samples from ${\displaystyle \pi _{t}(\mathbf {x} _{t})p(x_{t+1}|\mathbf {y} _{t+1},\mathbf {x} _{t})}$, we must weight them by:

${\displaystyle w(\mathbf {x} _{t+1})={\frac {\pi _{t+1}(\mathbf {x} _{t+1})}{\pi _{t}(\mathbf {x} _{t})p(x_{t+1}|\mathbf {y} _{t+1},\mathbf {x} _{t})}}={\frac {\pi _{t+1}(\mathbf {x} _{t})p(x_{t+1}|\mathbf {y} _{t+1},\mathbf {x} _{t})}{\pi _{t}(\mathbf {x} _{t})p(x_{t+1}|\mathbf {y} _{t+1},\mathbf {x} _{t})}}={\frac {\pi _{t+1}(\mathbf {x} _{t})}{\pi _{t}(\mathbf {x} _{t})}}}$

We then have a set of weighted samples, ${\displaystyle \{\mathbf {x} _{t+1}^{(j)},w_{t+1}^{(j)}\}}$ for ${\displaystyle j=0,\ldots ,m}$, which approximate ${\displaystyle \pi _{t+1}(\mathbf {x} _{t+1})}$. These weights can be used when it is possible to sample exactly from ${\displaystyle p(x_{t+1}|\mathbf {y} _{t+1}\mathbf {x} _{t}^{(j)})}$. If we are not able to, we can instead sample from an approximation, ${\displaystyle q(x_{t+1}|\mathbf {y} _{t+1}\mathbf {x} _{t}^{(j)})}$, and calculate weights so that the approximation of ${\displaystyle \pi _{t+1}(\mathbf {x} _{t+1})}$ is still exact (in the limit as ${\displaystyle m\rightarrow \infty }$). This leads to weights of the form:

${\displaystyle w(\mathbf {x} _{t+1})={\frac {\pi _{t+1}(\mathbf {x} _{t})p(x_{t+1}|\mathbf {y} _{t+1},\mathbf {x} _{t})}{\pi _{t}(\mathbf {x} _{t})q(x_{t+1}|\mathbf {y} _{t+1},\mathbf {x} _{t})}}}$

These weights may be kept and multiplied by later weights to approximate the posterior at later timesteps. Alternatively, the posterior can be resampled (as discussed later on), allowing the weights to be discarded.

#### Rejection Sampling

Rejection sampling is based on a similar idea to SIS but, instead of storing weights, samples are accepted with a probability proportional to the weight. Samples are continually drawn until ${\displaystyle m}$ have been accepted. The procedure is as follows:

1. Sample an index, ${\displaystyle j}$, with probability proportional to ${\displaystyle w_{t}^{(j)}}$ (or from a uniform distribution if the samples are not weighted)
2. Draw ${\displaystyle x_{t+1}^{(j)}}$ from ${\displaystyle g(x_{t+1}|\mathbf {x} _{t}^{(j)},\mathbf {y} _{t+1})}$, which should approximate ${\displaystyle \pi _{t+1}(x_{t+1}|\mathbf {x} _{t}^{(j)})}$ as well as possible for efficiency
3. Accept (${\displaystyle j}$, ${\displaystyle x_{t+1}^{(j)}}$) with probability ${\displaystyle p={\frac {\pi _{t+1}(\mathbf {x} _{t}^{(j)},x_{t+1})}{c_{t+1}\pi _{t}(\mathbf {x} _{t}^{(j)})g(x_{t+1}|\mathbf {x} _{t},\mathbf {y} _{t+1})}}}$ where ${\displaystyle c_{t+1}}$ is a constant chosen to ensure that ${\displaystyle p\leq 1}$.

This procedure is performed repeatedly until ${\displaystyle m}$ samples have been accepted. Variations of this method exist which can improve efficiency (e.g. the proportion of samples of accepted) with techniques such as Rao-Blackwellisation.

#### Resampling

One issue with the sequential importance sampling approach (and other, related approaches), is that the variance of the weights can grow large when they are continually changed over many time steps. Therefore, although we may have ${\displaystyle m}$ samples of ${\displaystyle \pi _{t}(\mathbf {x} _{t})}$, many of them may have a very small weight, and so our approximation could be dominated by very few samples (or even just one). One method to avoid this problem is to resample them from an empirical distribution given by:

${\displaystyle {\hat {\pi }}_{t}(\mathbf {x} _{t})=\sum _{j=1}^{m}{\frac {w_{t}^{(j)}}{\sum _{j'=1}^{m}w_{t}^{(j')}}}\delta _{\mathbf {x} _{t}^{(j)}}(\mathbf {x} _{t})}$

where ${\displaystyle \delta }$ is the Kronecker delta. This is simple to sample from by sampling from a categorical distribution over ${\displaystyle j}$ with probabilities proportional to the weights, and maintains the property that ${\displaystyle \mathbf {x} _{t}^{(j)}}$ are exact samples from ${\displaystyle \pi _{t}(\mathbf {x} _{t})}$in the limit as ${\displaystyle m\rightarrow \infty }$. Since this distribution, can be sampled from exactly, the resulting samples do not have weights, and so the resampling step gets rid of the weights, increasing the effective sample size. This procedure effectively gets rid of samples with small weights, and duplicates samples with high weights. The duplicated samples will diverge at the next timestep, so this procedure allows the most "likely" regions of sample space to be explored more. However, the resampling procedure also introduces noise to the procedure, and so it is typically not recommended to perform it at every step. Liu et al.[1] suggest either resampling at fixed intervals (with a size specific to the problem), or checking the effective sample size (a measure of the degeneracy of the weights) after each transition, and resampling when this decreases below a threshold. Also, note that the resampling step is the simplest technique, but other techniques exist which can add less noise.

## Sequential Monte Carlo samplers

As described in the previous section, SMC provides an efficient method for sampling from a sequence of related distributions, making it suitable for situations such as performing inference as observations are made. Del Moral et al.[2] suggest using SMC even when only a single distribution is to be inferred. Whereas Liu et al.[1] considered a sequence of ${\displaystyle \pi _{t}(\mathbf {x} _{t})}$ (and focused on the case where ${\displaystyle \mathbf {x} _{t}}$increases in dimension with ${\displaystyle t}$), Del Moral et al.[2] focus on the case where we only wish to infer ${\displaystyle \pi (\mathbf {x} )}$ (which we can evaluate down to a normalising constant). To allow SMC to be used, an artificial sequence of ${\displaystyle p}$ distributions is constructed from ${\displaystyle \pi (\mathbf {x} )}$. We shall denote this sequence as ${\displaystyle \pi _{1}(\mathbf {x} ),\ldots ,\pi _{p}(\mathbf {x} )}$. Note that ${\displaystyle \mathbf {x} }$ is now constant in dimension across all distributions. Several methods are discussed for creating this sequence:

1. If ${\displaystyle \pi (\mathbf {x} )}$ is a posterior distribution conditioned on ${\displaystyle p}$ variables, ${\displaystyle y_{1,\ldots ,p}}$, we can construct a series of distributions by setting ${\displaystyle \pi _{1}(\mathbf {x} )}$ to the prior and then conditioning each distribution on one more observation than the previous: ${\displaystyle \pi _{n+1}(\mathbf {x} )=p(\mathbf {x} |y_{1,\ldots ,n})}$ for ${\displaystyle n=1,\ldots ,p}$
2. In order to sample from an intractable distribution, ${\displaystyle \pi (\mathbf {x} )}$, we could construct a "smooth" path to it from ${\displaystyle \mu (\mathbf {x} )}$, some tractable initial distribution, as follows: ${\displaystyle \pi _{n}(\mathbf {x} )=\mu (x)^{1-\phi _{n}(x)}\pi (x)^{\phi _{n}(x)}}$where ${\displaystyle 0\leq \phi _{1}\leq \ldots \leq \phi _{p}=1}$

Other sequences are suggested if we wish to find the mode of ${\displaystyle \pi (\mathbf {x} )}$ or estimate the probability of a rare event, rather than sampling from it:

1. To converge to the mode of ${\displaystyle \pi (\mathbf {x} )}$, we can use the sequence ${\displaystyle \pi _{n}=\pi (x)^{\phi _{n}}}$ where ${\displaystyle \phi _{n}\rightarrow \infty }$ as ${\displaystyle n}$ increases
2. If ${\displaystyle \pi (\mathbf {x} )}$ is simple to sample from and we wish to estimate ${\displaystyle \pi (A)=\mathbb {P} (\mathbf {x} \in A)}$ for a rare event, ${\displaystyle A}$, such that ${\displaystyle \pi (A)\approx 0}$, we can use the sequence ${\displaystyle \pi _{n}(\mathbf {x} )=\pi (\mathbf {x} )\mathbf {1} _{A_{n}}(\mathbf {x} )}$ where ${\displaystyle A_{1}\supset \ldots \supset A_{n}=A}$, and the estimate of ${\displaystyle \pi (A)}$ is obtained form the normalising constant during inference.

### Transition Kernels

As in Liu et al.[1], SMC samplers requires some kernel to transition from a sample from ${\displaystyle \pi _{n}(\mathbf {x} )}$ to a sample from ${\displaystyle \pi _{n+1}(\mathbf {x} )}$. In Liu et al.[1], we discussed kernels consisting of SIS and rejection sampling. However, these were only suitable for the case when the dimension of ${\displaystyle \mathbf {x} }$ was increasing. Since the dimension of ${\displaystyle \mathbf {x} }$ is constant in the case of SMC samplers, these are no longer suitable. For the discussion of this paper, we denote the transition kernel by ${\displaystyle K_{n}(\mathbf {x} ,\mathbf {x} ')}$, which transitions from ${\displaystyle \mathbf {x} }$, a sample from ${\displaystyle \pi _{n-1}}$, to ${\displaystyle \mathbf {x'} }$, a sample from ${\displaystyle \pi _{n}}$. In order to minimise the variance of the weights, the optimal transition would be ${\displaystyle K_{n}(\mathbf {x} ,\mathbf {x} ')=\pi _{n}(\mathbf {x} ')}$. However, this is typically not possible and so it is common to use a local move from ${\displaystyle \mathbf {x} }$: Del Moral et al.[2] suggest using either a Markov chain Monte Carlo (MCMC) transition or an approximate Gibbs step. We will now describe these options in more detail.

#### Markov chain Monte Carlo:

${\displaystyle K_{n}(\mathbf {x} ,\mathbf {x} ')}$ may be set to be an MCMC kernel with ${\displaystyle \pi _{n}(\mathbf {x} ')}$ as a stationary distribution. This means that, if ${\displaystyle \mathbf {x} }$ is a sample from ${\displaystyle \pi _{n}(\mathbf {x} )}$, ${\displaystyle \mathbf {x} '}$ will also be a sample from ${\displaystyle \pi _{n}(\mathbf {x} )}$ (although it will not be independent of ${\displaystyle \mathbf {x} }$). Since ${\displaystyle \mathbf {x} }$ is a sample from ${\displaystyle \pi _{n-1}(\mathbf {x} )}$ instead, ${\displaystyle \mathbf {x} '}$ will not be a perfect sample from ${\displaystyle \pi _{n}(\mathbf {x} )}$, but will be "close" if ${\displaystyle \pi _{n}}$ is close to ${\displaystyle \pi _{n-1}}$ (and ${\displaystyle \mathbf {x} '}$ will be "closer" to ${\displaystyle \pi _{n}}$ than ${\displaystyle \mathbf {x} }$ was). The discrepancy is then accounted for in the weights (which are discussed later). A vast literature exists for constructing good MCMC kernels for arbitrary stationary distributions [3][4], and so this choice can take advantage of this literature to design an efficient kernel, and so approximate ${\displaystyle \pi _{n}(\mathbf {x} )}$ well.

Approximate Gibbs Step

This corresponds to the SIS and rejection sampling methods discussed above, where one variable (e.g. ${\displaystyle x_{n}}$) is sampled conditioned on all the other variables (e.g. ${\displaystyle \mathbf {x} _{n-1}}$). This is an exact Gibbs step if ${\displaystyle x_{n}}$ is sampled from ${\displaystyle p(x_{t+1}|\mathbf {y} _{t+1}\mathbf {x} _{t}^{(j)})}$ exactly, or an approximate step is SIS or an MCMC approach is used. Note that, since the dimension of ${\displaystyle \mathbf {x} }$ is constant in an SMC sampler, the sampled value will overwrite a previous value, rather than being appended to ${\displaystyle \mathbf {x} _{n-1}}$ as when this was discussed above.

### Weights

Ideally, it would be possible to calculate weights for each ${\displaystyle \mathbf {x} _{n}^{(j)}}$ as ${\displaystyle {\frac {\pi _{n}(\mathbf {x} _{n}^{(j)})}{\gamma _{n}(\mathbf {x} _{n}^{(j)})}}}$, where ${\displaystyle \gamma _{n}(\mathbf {x} _{n}^{(j)})}$ is the density function of the "proposal" distribution for ${\displaystyle \mathbf {x} _{n}^{(j)}}$ given by:

${\displaystyle \gamma _{n}(\mathbf {x} _{n}^{(j)})=\int \gamma _{1}(\mathbf {x} _{1}^{(j)})\prod _{n'=2}^{n}K_{n'}(\mathbf {x} _{n'-1}^{(j)},\mathbf {x} _{n'}^{(j)})d\mathbf {x} _{1},\ldots ,d\mathbf {x} _{n-1}}$.

However, the marginalisation over ${\displaystyle \mathbf {x} _{1},\ldots ,\mathbf {x} _{n-1}}$ is typically intractable and so it is impossible to calculate the weights in this form. To avoid this marginalisation, we consider the samples as samples of ${\displaystyle \mathbf {x} _{1,\ldots ,n}}$. from ${\displaystyle \pi _{1,\ldots ,n}}$, rather than just ${\displaystyle \mathbf {x} _{n}}$ from ${\displaystyle \pi _{n}}$. This means that the weights are given by:

${\displaystyle w_{1:n}^{(j)}={\frac {\pi _{1,\ldots ,n}(\mathbf {x} _{1,\ldots ,n}^{(j)})}{\eta _{1,\ldots ,n}(\mathbf {x} _{1,\ldots ,n}^{(j)})}}}$,

where ${\displaystyle \eta _{1,\ldots ,n}(\mathbf {x} _{1,\ldots ,n}^{(j)})=\eta _{1}(\mathbf {x} _{1}^{(j)})\prod _{n'=2}^{n}K_{n'}(\mathbf {x} _{n'-1}^{(j)},\mathbf {x} _{n'}^{(j)})}$. Since ${\displaystyle \pi _{n}}$ is a marginal of ${\displaystyle \pi _{1,\ldots ,n}}$, sampling from the joint distribution will yield samples from ${\displaystyle \pi _{n}}$. Since it is not always possible to evaulate ${\displaystyle K_{n}(\mathbf {x} _{n-1},\mathbf {x} _{n})}$ (for example, the rejection probability of an MCMC kernel is usually intractable), Del Moral et al.[2] propose to express ${\displaystyle \eta _{1,\ldots ,n}(\mathbf {x} _{1,\ldots ,n}^{(j)})}$ using artificial "backward" kernels ${\displaystyle L_{n}}$, such that:

${\displaystyle \eta _{1,\ldots ,n}(\mathbf {x} _{1,\ldots ,n}^{(j)})=\eta _{n}(\mathbf {x} _{n}^{(j)})\prod _{n'=2}^{n}L_{n'}(\mathbf {x} _{n'-1}^{(j)},\mathbf {x} _{n'}^{(j)})}$

Computing ${\displaystyle L_{n}}$ such that the above equation gives the "true" proposal distribution is still not necessarily tractable. However, Del Moral et al.[2] use the insight that the weights are only required to define an asymptotically correct marginal distribution over ${\displaystyle \mathbf {x} _{n}}$, and the distribution over ${\displaystyle \mathbf {x} _{1:n-1}}$ is unimportant. This means that ${\displaystyle L_{n}}$ need not be an accurate inverse of the transitions for the sampler to yield asymptotically consistent estimates (although the choice of ${\displaystyle L_{n}}$ will affect the variance of the weights). For a given ${\displaystyle L_{n}}$, the weights of particle ${\displaystyle j}$ is updated after step ${\displaystyle n}$ by multiplying by:

${\displaystyle {\tilde {w}}(\mathbf {x} _{n-1}^{(j)},\mathbf {x} _{n}^{(j)})={\frac {\gamma _{n}(\mathbf {x} _{n}^{(j)})L_{n-1}(\mathbf {x} _{n-1}^{(j)},\mathbf {x} _{n}^{(j)})}{\gamma _{n-1}(\mathbf {x} _{n-1}^{(j)})K_{n}(\mathbf {x} _{n-1}^{(j)},\mathbf {x} _{n}^{(j)})}}}$

where ${\displaystyle \gamma _{i}(\mathbf {x} )}$ is a (potentially) unnormalised density function of the ${\displaystyle i}$th distribution, ${\displaystyle \pi _{i}(\mathbf {x} )}$. Depending on the choice of ${\displaystyle L_{n-1}(\mathbf {x} _{n-1},\mathbf {x} _{n})}$, ${\displaystyle K_{n}(\mathbf {x} _{n-1},\mathbf {x} _{n})}$ may cancel out and so this technique can be used even when ${\displaystyle K_{n}(\mathbf {x} _{n-1},\mathbf {x} _{n})}$ is intractable.

### Algorithm

The algorithm for the SMC sampler is defined as follows:

1. Define ${\displaystyle L_{n-1}(x_{n-1},x_{n})}$, ${\displaystyle K_{n}(x_{n-1},x_{n})}$ sequence of ${\displaystyle \pi _{1}(\mathbf {x} ),\ldots ,\pi _{p}(\mathbf {x} )}$
2. Initialization:
• set ${\displaystyle n=1}$
• for ${\displaystyle j=1,\ldots ,m}$ draw ${\displaystyle \mathbf {x} _{1}^{(j)}\sim \eta _{1}}$
• set ${\displaystyle \{w_{1}^{(j)}\}=\{{\frac {\pi _{1}(\mathbf {x} _{1}^{(j)})}{\eta _{1}(\mathbf {x} _{1}^{(j)})}}\}}$ and normalise to obtain ${\displaystyle \{W_{1}^{(j)}\}}$
• iterate through steps 2 and 4
3. Resampling:
• If effective sample size less than some threshold, resample and set ${\displaystyle W_{n}^{(j)}={\frac {1}{m}}}$
4. Sampling:
• set ${\displaystyle n=n+1}$
• if ${\displaystyle n>p}$, stop
• for ${\displaystyle j=1,\ldots ,m}$ draw ${\displaystyle \mathbf {x} _{n}^{(j)}\sim K_{n}(\mathbf {x} _{n-1}^{(j)},.)}$
• set ${\displaystyle \{w_{n}^{(j)}\}=\{W_{n-1}^{(j)}{\frac {\gamma _{n}(\mathbf {x} _{n}^{(j)})L_{n-1}(\mathbf {x} _{n-1}^{(j)},\mathbf {x} _{n}^{(j)})}{\gamma _{n-1}(\mathbf {x} _{n-1}^{(j)})K_{n}(\mathbf {x} _{n-1}^{(j)},\mathbf {x} _{n}^{(j)})}}\}}$ and normalise to obtain ${\displaystyle \{W_{n}^{(j)}\}}$

## My Thoughts on the Contributions

Del Moral et al.[2] suggest that SMC can be used even when inference is only required over a single distribution, rather than a series of distributions. A very powerful application of this is for inference in well-defined models, in addition to the likes of Metropolis-Hastings and Gibbs sampling, and SMC can sometimes provide significantly improved inference compared to these methods. One potential issue is that it is not necessarily clear how best to select parameters such as the transition kernels, and how to define the sequence of distributions (which the quality of inference is presumably heavily dependent on). Despite this, I believe that the introduction of the framework is a valuable contribution.

## Annotated Bibliography

1. Liu JS, Chen R. Sequential Monte Carlo methods for dynamic systems. Journal of the American statistical association. 1998 Sep 1;93(443):1032-44.
2. Del Moral P, Doucet A, Jasra A. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006 Jun 1;68(3):411-36.
3. Hoffman MD, Gelman A. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research. 2014 Apr 1;15(1):1593-623.
4. Haario H, Saksman E, Tamminen J. An adaptive Metropolis algorithm. Bernoulli. 2001;7(2):223-42.

 Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0