# Course:CPSC522/Bayesian Coresets

## Bayesian Coresets

In order to decouple increasing data sizes from the run-time complexity of Bayesian posterior inference methods like MCMC, Bayesian coresets adaptively subsample and weight observations, thereby constructing a generic approximate sufficient statistic.

Principal Author: Boyan Beronov

Collaborators:

## Abstract

As data sizes grow towards the infinite limit, the cost of Bayesian inference becomes increasingly prohibitive for algorithms which don't approximately exhibit linear run-time scaling and constant memory cost. This is a major reason why, despite their inherent bias, variational inference methods have become increasingly popular and relevant in comparison to the well established, but slow MCMC methods which are at least asymptotically unbiased, i.e., converge to the exact value in infinite run-time. Bayesian coresets provide a practical way of scaling up the latter by first summarizing the statistical information present in observations, – i.e., by approximating the full likelihood function at a controllable trade-off between accuracy and memory cost – and then running expensive inference algorithms on the compressed data set. Under the sole assumptions of conditional independence and tractable likelihood, recent work has developed linear cost algorithms for coreset construction with rigorous convergence bounds, which in particular are independent of the internals of the generative model. As a new, generic means for constructing approximate sufficient statistics, these methods might evolve to have significant impact on many problems of interest in probabilistic machine learning and artificial intelligence.

### Builds on

Bayesian coresets arise when the idea of coresets, which originated in the context of approximation algorithms in computational geometry, is applied to Bayesian inference. Once constructed, coresets can provide significant speedups to approximate Bayesian inference, e.g., for algorithms in the MCMC and Variational Inference families. Their construction is an interesting computational problem in its own right, and is closely related to Monte Carlo methods and the sparse approximation problem.

### Related Pages

Any expensive density estimation problem which tolerates approximate inference algorithms is a potential application of Bayesian coresets. This includes: Graphical Models and Bayesian Networks, Learning Probabilistic Models with Complete Data and Density-Based Unsupervised Learning. Furthermore, as a form of episodic memory, Bayesian coresets might have more general implications for online learning, Continual Learning, Bounded Rationality and Artificial General Intelligence.

## Content

### Definition

#### Geometric Coresets

Coresets were first developed in computational geometry as a means for approximating the shape of a large point set – see Agarwal et al. [1] for a comprehensive review. If an optimization problem admits an ${\displaystyle \varepsilon }$-coreset, then there is a linear or near-linear algorithm for selecting a constant size subset of the original problem, such that the exact solution to the small problem is an approximation to the exact solution of the original problem within a factor of ${\displaystyle (1+\varepsilon )}$.

One of the original applications of geometric coresets is clustering, via algorithms like ${\displaystyle k}$-center, ${\displaystyle k}$-line-center, ${\displaystyle k}$-median and ${\displaystyle k}$-means. As the latter two are instances of summed clustering, where the objective is a sum over points, their coreset formulation operates on weighted point sets [1]. In particular, let ${\displaystyle P\subset \mathbb {R} ^{D}}$ be a point set, ${\displaystyle d(p,C)=\min _{q\in C}\,d(p,q)}$ a distance function, ${\displaystyle w:\,P\to \mathbb {N} }$ a weight function, and ${\displaystyle \mu _{P}^{w}(C)=\sum _{p\,\in \,P}w(p)\,d(p,C)}$ the objective function to be minimized over candidate cluster sets ${\displaystyle C\subset \mathbb {R} ^{D}}$ of fixed size. Such problems admit ${\displaystyle \varepsilon }$-coresets, i.e., subsets ${\displaystyle {\bar {P}}\subseteq P}$ and weight functions ${\displaystyle {\bar {w}}:\,P\to \mathbb {N} }$ such that for any ${\displaystyle C}$ of appropriate size,

${\displaystyle (1-\varepsilon )\,\mu _{P}^{w}(C)\;\leq \;\mu _{\bar {P}}^{\bar {w}}(C)\;\leq \;(1+\varepsilon )\,\mu _{P}^{w}(C)\;.}$

If the points ${\displaystyle p\in P}$, the cluster set ${\displaystyle C}$ and the distance function ${\displaystyle d(p,C)}$ are replaced respectively by observations ${\displaystyle x\in X}$, parameters ${\displaystyle \theta }$ and a log-likelihood function ${\displaystyle \log \left\{p(\{x\};\theta )\right\}}$, the structural similarity of the clustering problem to maximum likelihood estimation (MLE) becomes apparent. This naturally leads to the question whether the notion of approximation via weighted coresets can be extended to the setting of statistical inference.

#### Bayesian Coresets

Let a Bayesian model for observations ${\displaystyle X=\{x_{1},\dots ,x_{n}\}}$ consist of a prior measure ${\displaystyle \Pi _{\emptyset }(\operatorname {d} \!\theta )}$ and a likelihood ${\displaystyle p(X;\theta )}$ with parameters ${\displaystyle \theta \in \mathbb {R} ^{D}}$. Define the log-likelihood or potential ${\displaystyle {\mathcal {L}}_{X}(\theta )=\log \left\{p(X;\theta )\right\}}$ and the marginal likelihood, evidence, or partition function ${\displaystyle Z_{X}=\int \exp \left\{{\mathcal {L}}_{X}(\theta )\right\}\Pi _{\emptyset }(\operatorname {d} \!\theta )}$. Then the posterior measure over parameters ${\displaystyle \theta }$ is

${\displaystyle \Pi _{X}(\operatorname {d} \!\theta )={\frac {1}{Z_{X}}}\,\exp \left\{{\mathcal {L}}_{X}(\theta )\right\}\,\Pi _{\emptyset }(\operatorname {d} \!\theta )\;.}$

Under the conditional independence assumption, ${\displaystyle \forall \;i,j\in [n]:\;x_{i}\perp \!\!\!\perp _{\theta }x_{j}}$, the potential decomposes as

${\displaystyle {\mathcal {L}}_{X}(\theta )=\log \,\left\{\prod _{x\,\in \,X}p\left(\{x\};\theta \right)\right\}=\sum _{x\,\in \,X}{\mathcal {L}}_{\{x\}}(\theta )\;.}$

In order to reduce the cost of evaluating ${\displaystyle {\mathcal {L}}_{X}(\theta )}$ in inference methods, a Bayesian coreset [2] approximation is introduced as

${\displaystyle {\mathcal {L}}_{\bar {X}}^{w}=\sum _{x\,\in \,{\bar {X}}}w(x)\,{\mathcal {L}}_{\{x\}}\;,}$

consisting of subsampled data ${\displaystyle {\bar {X}}\subseteq X}$ and a corresponding weight function ${\displaystyle w:\,{\bar {X}}\to \mathbb {R} ^{+}}$. The coreset posterior approximation induced by ${\displaystyle {\mathcal {L}}_{\bar {X}}^{w}}$ is then

${\displaystyle \Pi _{\bar {X}}^{w}(\operatorname {d} \!\theta )={\frac {1}{Z_{\bar {X}}^{w}}}\,\exp\{{\mathcal {L}}_{\bar {X}}^{w}(\theta )\}\,\Pi _{\emptyset }(\operatorname {d} \!\theta )\;,\quad Z_{\bar {X}}^{w}=\int \exp\{{\mathcal {L}}_{\bar {X}}^{w}(\theta )\}\,\Pi _{\emptyset }(\operatorname {d} \!\theta )\;.}$

### Construction Algorithms

There are several variations of Bayesian coresets, including approaches such as likelihood-based data squashing (LDS) by Madigan et al. [3], which construct pseudo data points for summarization. The following discussion is restricted to methods which only select and weight observed data points.

#### Sampling Approaches

In the realm of randomized approximation algorithms, the first improvement over trivial uniform sampling is importance sampling, in which subsampling is performed according to some distribution that correlates with the likelihood of data points. Variations of this approach are taken by most existing literature, both for geometric and for Bayesian coresets, including Feldman et al. [4][5][6], Zhang et al. [7], Huggins et al. [2], Campbell et al. [8] and Lucic et al. [9][10]

#### Optimization Approaches: Hilbert Coresets

In order to improve the theoretical guarantees of Bayesian coresets over randomized constructions, Campbell et al. [8] introduced a new family of generic, deterministic algorithms. When the function space of ${\displaystyle {\mathcal {L}}}$ is a Hilbert space, i.e., when it admits an inner product, Hilbert coresets can be computed by solving suitable optimization problems formulated only in terms of inner products and Hilbert norms of the (function) vectors ${\displaystyle {\mathcal {L}}_{X}}$ and ${\displaystyle {\mathcal {L}}_{\{x\}}}$. Instances of this family of methods include:

• Quadratic optimization under a polytope constraint using the Frank-Wolfe algorithm [8],
• Manifold optimization on the unit hyper-sphere using iterative geodesic search (generalization of line search) [11], maximizing the directional alignment between the vectors of the full (${\displaystyle {\mathcal {L}}_{X}}$) and coreset (${\displaystyle {\mathcal {L}}_{\bar {X}}^{w}}$) potentials.

Both methods are greedy and iterative, achieving the linear run-time required by the coreset problem formulation. The main contribution of the latter algorithm is the optimal scaling of the coreset potential, which significantly reduces approximation errors especially in the empirically relevant regime of small coresets. Regarding the choice of Hilbert norms, several options have been suggested, including a weighted ${\displaystyle L^{2}}$ norm and a weighted Fisher information metric. In general, such norms require a cheap approximation to the true posterior as a weighting measure.

#### Convergence Analysis

Several convergence results exist for model-specific coreset constructions, in particular for mixture models and logistic regression as in [2][4][10]. Furthermore, adaptations of classical optimization methods have been proposed for the weighted likelihood setting, e.g., weighted EM [10].

For the generic case, Huggins et al. [12] provide a non-asymptotic approach for controlling the error of finite-size Bayesian posterior approximations, by means of a two-step bound using the Fisher information and Wasserstein metrics. This enables the full characterization of the statistical error in the iterative constructions of [8] and [11], including the effect of the weighting distribution. Specifically, these approaches achieve a decay of the Wasserstein posterior distance which is exponential in ${\displaystyle |{\bar {X}}|}$, the coreset size.

#### Streaming and Parallel Computation

An important property of many coreset constructions is their compositionality, which enables their construction to be performed sequentially or in parallel on small batches of data, with a subsequent efficient merge operation preserving theoretical approximation guarantees [1]. In particular, the following properties hold for many geometric coresets:

1. If ${\displaystyle {\bar {A}},{\bar {B}}}$ are ${\displaystyle \varepsilon }$-coresets for ${\displaystyle A,B}$, then ${\displaystyle {\bar {A}}\cup {\bar {B}}}$ is an ${\displaystyle \varepsilon }$-coreset for ${\displaystyle A\cup B}$.
2. If ${\displaystyle {\bar {A}}}$ is an ${\displaystyle \varepsilon }$-coreset for ${\displaystyle A}$, and ${\displaystyle {\bar {\bar {A}}}}$ is an ${\displaystyle {\bar {\varepsilon }}}$-coreset for ${\displaystyle {\bar {A}}}$, then ${\displaystyle {\bar {\bar {A}}}}$ is an ${\displaystyle {\bar {\bar {\varepsilon }}}}$-coreset for ${\displaystyle A}$ with ${\displaystyle 1+{\bar {\bar {\varepsilon }}}=(1+\varepsilon )(1+{\bar {\varepsilon }})}$.

In both the sequential and the parallel setting, an optimal strategy for merging ${\displaystyle b}$ batch-coresets is then an alternation between (1) and (2) in a tree of depth ${\displaystyle O(\log b)}$.

Analogous properties hold for several Bayesian coreset constructions, e.g., in [4] and in [8] when the approximation error is measured in the weighted uniform norm ${\displaystyle \|\,l\,\|:=\sup _{\theta }\left|{\frac {l(\theta )}{{\mathcal {L}}_{X}(\theta )}}\right|}$. But more generally, the posterior approximation bounds from [12] provide a formalism better suited for characterizing the compositional guarantees of Bayesian coreset constructions.

#### Related Numerical Problems

The Hilbert coreset formulation is related to the sparse approximation problem, for which a wide range of numerical methods have been studied. See [8][11] for a discussion of the particular requirements in the coreset setting.

### Relevance for Continual Learning

A long-standing challenge for machine learning and AI systems is the problem of Continual Learning or lifelong learning, i.e., the ability of an agent to learn over long time periods by incorporating new knowledge while retaining past experiences. The main obstacle for computational models of continual learning is the phenomenon of catastrophic forgetting , whereby new information interferes destructively with previously acquired knowledge representations. In a recent comprehensive review from the perspective of Artificial Neural Networks, Parisi et al. [13] describe the influence of the Complementary Learning Systems theory [14][15] on various deep learning architectures, in particular the use of dual memory and experience replay. The essential idea, motivated by a broad range of anatomical and cognitive studies, is an intricate interplay of two learning systems with complementary roles via storage, retrieval and goal-dependent replay:

• Semantic memory: responsible for learning general regularities/statistics of the environment; slow learning rate and overlapping, structured representations; associated with the neo-cortex,
• Episodic memory: responsible for learning arbitrary specifics of the environment; rapid learning rate and sparse representations; associated with the hippocampus.

In the context of Bayesian continual learning, the semantic and episodic memories could be interpreted as roughly corresponding to a generative model and a coreset – see, e.g., the continual learning model of Nguyen et al. [16]

## Annotated Bibliography

1. Agarwal, Pankaj K., Sariel Har-Peled, and Kasturi R. Varadarajan. "Geometric approximation via coresets." Combinatorial and computational geometry 52 (2005): 1-30. http://library.msri.org/books/Book52/files/01agar.pdf
2. Huggins, Jonathan, Trevor Campbell, and Tamara Broderick. "Coresets for scalable Bayesian logistic regression." Advances in Neural Information Processing Systems. 2016. http://papers.nips.cc/paper/6485-coresets-for-scalable-bayesian-logistic-regression
3. Madigan, David, et al. "Likelihood-based data squashing: A modeling approach to instance construction." Data Mining and Knowledge Discovery 6.2 (2002): 173-190. https://link.springer.com/article/10.1023/A:1014095614948
4. Feldman, Dan, Matthew Faulkner, and Andreas Krause. "Scalable training of mixture models via coresets." Advances in neural information processing systems. 2011. http://papers.nips.cc/paper/4363-scalable-training-of-mixture-models-via-coresets
5. Feldman, Dan, Melanie Schmidt, and Christian Sohler. "Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering." Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2013. https://dl.acm.org/citation.cfm?id=2627920
6. Braverman, Vladimir, Dan Feldman, and Harry Lang. "New frameworks for offline and streaming coreset constructions." arXiv preprint arXiv:1612.00889 (2016). https://arxiv.org/abs/1612.00889
7. Zhang, Min, et al. "Computational efficient Variational Bayesian Gaussian Mixture Models via Coreset." Computer, Information and Telecommunication Systems (CITS), 2016 International Conference on. IEEE, 2016. https://ieeexplore.ieee.org/abstract/document/7546405/
8. Campbell, Trevor, and Tamara Broderick. "Automated Scalable Bayesian Inference via Hilbert Coresets." arXiv preprint arXiv:1710.05053 (2017). https://arxiv.org/abs/1710.05053
9. Lucic, Mario, Olivier Bachem, and Andreas Krause. "Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures." Artificial Intelligence and Statistics. 2016. http://proceedings.mlr.press/v51/lucic16.html
10. Lucic, Mario, et al. "Training Gaussian mixture models at scale via coresets." The Journal of Machine Learning Research 18.1 (2017): 5885-5909. http://www.jmlr.org/papers/v18/15-506.html
11. Campbell, Trevor, and Tamara Broderick. "Bayesian coreset construction via greedy iterative geodesic ascent." arXiv preprint arXiv:1802.01737 (2018). https://arxiv.org/abs/1802.01737
12. Huggins, Jonathan H., et al. "Practical bounds on the error of Bayesian posterior approximations: A nonasymptotic approach." arXiv preprint arXiv:1809.09505 (2018). https://arxiv.org/abs/1809.09505
13. Parisi, German I., Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. "Continual lifelong learning with neural networks: A review." Neural Networks (2019). https://www.sciencedirect.com/science/article/pii/S0893608019300231
14. McClelland, James L., Bruce L. McNaughton, and Randall C. O'reilly. "Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory." Psychological review 102.3 (1995): 419. https://pdfs.semanticscholar.org/57f1/16f3e6780424463cc8416ce755a72f873aa9.pdf
15. Kumaran, Dharshan, Demis Hassabis, and James L. McClelland. "What learning systems do intelligent agents need? Complementary Learning Systems theory updated." Trends in cognitive sciences 20.7 (2016): 512-534. https://www.sciencedirect.com/science/article/pii/S1364661316300432
16. Nguyen, Cuong V., et al. "Variational continual learning." arXiv preprint arXiv:1710.10628 (2017). https://arxiv.org/abs/1710.10628