# Course:CPSC522/Learning Attention via Active Inference

## Learning Attention with Active Inference

We investigate the application of the theory of active inference to the design of attention mechanisms in deep learning. In particular, we develop a tractable approximation to the information-theoretically optimal attention policy given by Bayesian experimental design. While this formulation only combines existing optimization methods in a novel way, we hypothesize that it could prove more efficient and biologically plausible than current computational models of attention.

Principal Author: Boyan Beronov, William Harvey

Collaborators:

## Abstract

Information processing systems can significantly benefit from attention in terms of computational efficiency, by adaptively focusing computational effort on the most salient parts of their input. Attention has been widely studied in biology, in particular in visual systems, and can be found in many examples of algorithm design. Recently, many deep learning systems have incorporated this approach, in problem domains as diverse as computer vision and natural language processing. In this context, attention mechanisms are usually incorporated into neural networks as masks or filters on inputs or intermediate representations, and trained jointly with other network weights by stochastic gradient descent on some end-to-end supervision loss. Attention mechanisms of this sort lie on a spectrum between differentiable, but computationally expensive (soft attention) and fast, but non-differentiable (hard attention). In practice, the use of attention in real-world applications is limited, as the high variance in gradient estimators and the indirect training signal, which does not incorporate terms specific to attention, lead to difficult training procedures.

An alternative view of attention, which is grounded in information theory, regards it as an instance of Bayesian experimental design. Given some quantity of interest, attention would then be directed at the most informative parts of the input, i.e., those which maximize the expected information gain under the current belief. This formulation provides a more interpretable, direct and potentially faster approach for training attention mechanisms. The computational cost of Bayesian experimental design might, however, outweigh the increase in training speed.

The approach chosen in our experiment for training an attention mechanism is active inference, also known as the variational free energy principle. This process theory [1] describes non-equilibrium steady state systems in information-theoretic terms, suggesting a unified view of inference and action. It originated in cognitive neuroscience, and has since been successfully applied as a modelling tool for a variety of biological and cognitive phenomena. It describes the totality of an agent's information processing as a dynamical process performing variational Bayesian inference to optimize both its beliefs — for modelling the state and dynamics of the world and its own sensory inputs — and its action policy — for fulfilling its motivations. While active inference has been used to study biological attention mechanisms, it has so far been scarcely applied in machine learning in its general form.

Our model employs a recent computational reformulation of active inference [2][3], consisting of the following components:

• Attention saccades are actions, and they result in sensory inputs which are correspondingly masked observations of the environment. These are the only sources of information the agent has direct access to.
• They are used in a first variational inference process to update two belief representations:
• the generative model, which defines a joint distribution over past and future hidden (world) states and sensory states, conditioned on past and future actions,
• and the active posterior, which defines a distribution over future hidden and sensory states, conditioned on past experience and possible future actions.
• When supplied with the current active posterior, the agent's motivation functional defines a scalar potential for each possible action.
• This potential in turn yields a Gibbs distribution over actions. The induced policy for the next action arises from a second variational inference process which approximates this intractable distribution.
• We define label prediction for observed images as another form of action next to attention saccades, and a supervision reward is incorporated as an additional channel of sensory input.
• The agent's behavior is then determined by way of its motivation functional, which is composed in our setting of the expected cumulative discounted reward and the expected information gain with respect to label prediction.

The motivation functional defines a generic computational model for teleological (goal-directed) behavior [4] which can depend on arbitrary scalar functions of future trajectories. The exact formulation and variational approximation of complex goals, such as information gain maximization, is only possible because of the agent's explicit, probabilistic dynamical model. From a machine learning perspective, an active inference formulation enables greater flexibility and control over model training, by making it possible to specify arbitrary objectives over policies and trajectories, and allowing them to be approximated and optimized incrementally by the agent. This is in contrast to the currently common strategy for such problems, in which expensive data augmentations are computed offline and added to the cost terms of a reinforcement learning problem, which is in turn solved by batch optimization. In short, the application of the theory of active inference in machine learning uses established probabilistic models and optimization algorithms, but recombines them into a structure which is motivated by greater biological plausibility.

In the context of image classification, our experiment tests the hypothesis that an active inference formulation of visual attention is able to successfully learn an attention policy on par with the Bayes-optimal policy computed offline.

### Builds on

Statistical inference: Probability, Variational Inference, Variational Bayesian inference, Bayesian experimental design, Gibbs measure.

### Related Pages

Theories of intelligent behavior: Free energy principle, Control Theory, Reinforcement Learning, Game Theory, Bounded Rationality, Artificial General Intelligence.

Generalized filtering is a Bayesian filtering method for state space models (similar to Kalman filtering and particle filtering), and is based on Markovian assumptions about random fluctuations, variational inference and the fundamental lemma of the calculus of variations. Active inference rests on the same mathematical techniques, but lifts the framework to a general theory of non-equilibrium steady states as systems capable of autopoiesis, i.e., counteracting fluctuations of their internal structure and environment, by performing approximate inference about the world dynamics. In particular, this encompasses a theory of intelligence subscribing to the Bayesian brain hypothesis [5].

## Attention

Attention describes the act of an information processing system focusing computation on the most salient parts of an input, and allows for vast gains in efficiency by preventing the "waste" of computation on less important parts. The human brain alone performs attention at multiple different levels; the most obvious of these is the component of visual attention directed by eye movements (saccades).

### Attention in Deep Learning

Inspired by the widespread use of attention in both natural and artificial systems, much work in recent years has approached the task of integrating attention with deep learning, such that neural networks are trained to adaptively select which part of an input to attend to. Almost all modern approaches train this policy so as to optimize the end-to-end loss of the network for the given task (e.g. image classification). Attention in deep learning falls into two categories: soft attention, which is computationally expensive but simple to train; and hard attention, which is cheaper but harder to train, and so seldom used in real-world applications.

#### Soft Attention

In soft attention, an attention policy defines a set of weights over every different part of an input. The output of the attention mechanism is then a weighted average of the embedding of every part of the input, with the weights defined by the attention policy. The use of a simple weighted average means that it is possible to differentiate the loss with respect to the attention policy's parameters, and so to compute low-variance estimates of the gradient using back-propagation. This makes soft attention simple to train, and thus it has become widely used in fields such as Natural Language Processing[6]. However, since soft attention essentially attends to all of an input, it does not bring about the great gains in computational efficiency discussed earlier.

#### Hard Attention

In contrast to soft attention, hard attention mechanisms make a "hard" decision about where to attend, and thereby avoid processing all of the input[7]. Hence, hard attention has the promise of greater computationally efficiency that is sought after in computer vision. However, hard attention remains little-used in practical applications. The primary reason for this is that it is difficult to train: most mechanisms of extracting a region of an input are non-differentiable with respect to the patch parameters, and so it is not possible to train hard attention simply by back-propagation. The alternative used in most applications is to frame the learning of an attention policy as Reinforcement Learning. However, this leads to high-variance gradient estimates and so training can be slow, as well as being prone to getting stuck in local optima.

### Attention as Bayesian Experimental Design

Figure 1. Training accuracy against number of examples trained on, for MNIST digit classification. A hard attention mechanism is used which can attend to only one pixel at a time, for a maximum of six pixels. Using experimental design to augment the loss for 10% of the training examples (red curve) can be seen to greatly improve the training speed compared to using the REINFORCE loss alone.

Bayesian experimental design provides an alternative to formulating attention as a neural network module to train according to some end-to-end loss[8]. Bayesian experimental design is an information-theoretic framework defining an optimal procedure for learning about some unknown quantity. It states that actions should be chosen to maximize the expected information gain between the prior distribution over this quantity (before attending) and the posterior distribution over this quantity (the updated belief after performing attention):

${\displaystyle \mathbf {I} (\Theta ;Y|d)=\mathbb {E} _{p(\Theta ,Y|d)}\left[{\mathcal {H}}\left[p(\theta )\right]-{\mathcal {H}}\left[p(\theta |y,d)\right]\right]}$

Here, ${\displaystyle \Theta }$ is the latent variable of interest and ${\displaystyle Y}$ is the observation, which has a distribution dependent on both ${\displaystyle \Theta }$ and the design of the experiment ${\displaystyle d}$. In the context of attention, ${\displaystyle d}$ is the part of the input which is attended to. Bayesian experimental design is optimal in the sense that it will minimize the entropy in the posterior in the latent variable of interest. Attending in this way can also be shown to be the optimal policy for minimizing an end-to-end loss when the loss used is the negative log-likelihood of a predicted distribution over ${\displaystyle \Theta }$.

In general, finding ${\displaystyle d}$ to maximize the expected information gain is intractable, and approximations are computationally expensive. This prevents explicit Bayesian experimental design from being used in deep learning systems. One approach to integrating Bayesian experimental design with deep learning whilst avoiding this issue is to use the theory of Bayesian experimental design to compute an approximately optimal sequence of locations to attend to on some training data, before training a neural network to approximate this policy. Figure 1 illustrates that this can greatly speed up the training of hard attention mechanisms.

Although the use of Bayesian experimental design means that far fewer iterations are needed for training, the high computational cost of augmenting the training data with these optimal locations can outweigh the benefits of the reduction in training iterations.

## Active Inference

Main article: Free energy principle

### Difference to Variational Bayesian Inference

The term variational carries several different meanings. In variational Bayesian inference, it refers to the variational free energy bound [9], which allows the approximation of intractable integrals by an optimization of a bounding quantity. This is a different notion than the variational principle, which allows the description of functions as minima of functionals. Active inference combines both concepts, since it describes the dynamics of non-equilibrium steady state systems as satisfying the variational principle of least action, where action is defined as the path integral of variational free energy.

### Difference to Control Theory / Reinforcement Learning / Game Theory

Main article: Active_inference

### Model Components

The following table summarizes the key components of active inference, in a formalism which is a slightly simplified version of the exposition in [2][3]. See the figures therein for an intuitive graphical overview of the model. For simplicity, we are omitting the distinction between hidden parameters ${\displaystyle \theta }$ and hidden variables ${\displaystyle e_{t}}$, since our experimental setup does not include any world dynamics other than the agent's attention saccades. The bold components are parametrized as neural networks in our experiment, and all other components constitute the computations required for training them.

Components of the Active Inference model in [2][3]
component formula
environment (hidden) states: true / modeled ${\displaystyle a_{t}\in A\;\mid \;{\hat {a}}_{t}\in {\hat {A}}}$
sensory (input) states: true / modeled ${\displaystyle e_{t}\in E\;\mid \;{\hat {e}}_{t}\in {\hat {E}}}$
actions: true / modeled ${\displaystyle s_{t}\in S\;\mid \;{\hat {s}}_{t}\in {\hat {S}}}$
true (hidden) dynamics ${\displaystyle p_{E,S\mid E,A}\!\left(e_{t+1:T},s_{t+1:T}\mid e_{t},a_{t+1:T}\right)}$
variational generative model ${\displaystyle {\hat {p}}_{{\hat {E}},{\hat {S}}\mid {\hat {E}},{\hat {A}}}\!\left({\hat {e}}_{t+1:T},{\hat {s}}_{t+1:T}\mid {\hat {e}}_{t},{\hat {a}}_{t+1:T}\right)}$
active posterior ${\displaystyle q_{{\hat {E}},{\hat {S}}\mid S,A,{\hat {A}}}\!\left({\hat {e}}_{0:T},{\hat {s}}_{t+1:T}\mid s_{0:t},a_{0:t},{\hat {a}}_{t+1:T}\right)}$
variational active posterior ${\displaystyle {\hat {q}}_{{\hat {E}},{\hat {S}}\mid {\hat {A}}}\!\left(\cdot \mid {\hat {a}}_{t+1:T},{\underline {\Phi _{t}}}\right)\approx q_{{\hat {E}},{\hat {S}}\mid S,A,{\hat {A}}}\!\left(\cdot \mid \cdot ,{\hat {a}}_{t+1:T}\right)}$
motivation functional ${\displaystyle {\mathcal {M}}:{\mathcal {P}}\!\left({\hat {E}},{\hat {S}}\mid {\hat {A}}\right)\times {\hat {A}}\to \mathbb {R} }$
induced policy ${\displaystyle \pi _{q}^{{\mathcal {M}},\beta }({\hat {a}}_{t+1:T})={\frac {\exp \left\{\beta \cdot {\mathcal {M}}\left(q,{\hat {a}}_{t+1:T}\right)\right\}}{\int _{\hat {A}}\exp \left\{\beta \cdot {\mathcal {M}}\left(q,{\hat {a}}_{t+1:T}\right)\right\}\operatorname {d} \!{\hat {a}}_{t+1:T}}}}$
induced variational policy ${\displaystyle \pi _{{\hat {q}}_{{\hat {E}},{\hat {S}}\mid {\hat {A}}}\!\left(\cdot \mid \cdot ,{\underline {\Phi _{t}}}\right)}^{{\mathcal {M}},\beta }}$
variational induced variational policy ${\displaystyle {\hat {\pi }}\!\left({\hat {a}}_{t+1:T}\mid {\underline {\Psi _{t}}}\right)\approx \pi _{{\hat {q}}_{{\hat {E}},{\hat {S}}\mid {\hat {A}}}\!\left(\cdot \mid \cdot ,{\underline {\Phi _{t}}}\right)}^{{\mathcal {M}},\beta }\!\left({\hat {a}}_{t+1:T}\right)}$

## Experiment

### Definition of Environment

We experimented by training an agent to perform inference about a 5x5 grid world, while attending to only one pixel at a time. The world consists of a stochastically sampled number of horizontal, diagonal, and vertical lines. These can have varying brightness, and be located anywhere on the image. Samples of this world are shown in Figure 2. Using the principles of active inference, the agent learned to attend to a single pixel at a time, and to use the resulting sensory information to infer the number of horizontal lines in the image.

Figure 2. Samples of world state, consisting of a number of horizontal, vertical and diagonal lines. The agent attempts to infer the total number of lines by inferring one pixel at a time.

### Inference on Environment

We implemented our experiment in the probabilistic programming library Pyro, in terms of (1) an environment model and (2) an active posterior program, parameterized by neural networks and representing the agent's belief. This allowed SVI [14] to be used to infer the belief over the environment given observations of some of the pixels. This was run for 300 steps after the observations were made, as a trade-off between getting close to convergence and running-time. For our experiments, we sampled a world from the prior and rendered it. The agent was then allowed to observe up to 10 pixels in this world. Figure 3 shows a visualization of this inference process. It can be seen that the agent's belief is close to the true world, particularly in the vicinity of the locations it has observed.

Figure 3. A visualisation of the agent's inference over its belief state. Left: The true-world. Middle: The pixels which the agent observed (shown in white). Right: The expectation of the world, according to the agents' belief (i.e. the variational distribution).

### Inference on Policy

The agent used the expected entropy after the next single action and observation as the motivational functional. To infer the Gibbs distribution induced by this, the agent enumerated over all possible actions (to consider attending to each possible pixel). For each one, several brightness values were sampled from the previous belief state (5 were sampled in the figure shown). For each brightness sampled, SVI was run to infer a posterior (as if the agent had observed this brightness at this pixel). The entropy of the posterior over the number of horizontal lines could then be calculated simply due to the way that the probabilistic guide program was written - the number of horizontal lines was the first random variable sampled, and so its marginal distribution was explicit in the program.

Figure 4. Posterior entropy given by different pixels.

The approximated policy over actions (given by a softmax of the mutual information, or equivalently by a sofmax of the negative expected posterior entropy) is shown in Figure 4. Brighter pixels represent more likely actions. The image is slightly noisy, which may be due to the variational inference not fully converging, or the relatively low number of samples of the environment used to compute the expectation. However, the pixels to which the agent assigns highest mass appear to be reasonably sensible choices. Note that the agent will sometimes place high probability on looking at a pixel it has already looked at, since looking multiple times can be informative due to the observation noise.

### Future Work

The experiments described above indicate that the computational formalization of the theory of active inference is behaving sensibly. In order to scale these methods onto more realistic scenarios, future work will incorporate more efficient Bayesian inference methods for interfacing between the various belief components, in particular inference amortization via neural networks and Sequential Monte Carlo samplers.

## Annotated Bibliography

1. Friston, Karl, et al. "Active inference: a process theory." Neural computation 29.1 (2017): 1-49. https://www.fil.ion.ucl.ac.uk/~karl/Active%20Inference%20A%20Process%20Theory.pdf
2. Biehl, Martin, et al. "Expanding the Active Inference Landscape: More Intrinsic Motivations in the Perception-Action Loop." Frontiers in neurorobotics 12 (2018). https://www.frontiersin.org/articles/10.3389/fnbot.2018.00045/full
3. Biehl, Martin. "Geometry of Friston’s active inference." 1st Symposium on Advances in Approximate Bayesian Inference (2018). http://approximateinference.org/2018/accepted/Biehl2018.pdf
4. Rosenblueth, Arturo, Norbert Wiener, and Julian Bigelow. "Behavior, purpose and teleology." Philosophy of science 10.1 (1943): 18-24. http://turing.iimas.unam.mx/CA/sites/default/files/RosenbluethEtAl1943_0.pdf
5. Friston, Karl, James Kilner, and Lee Harrison. "A free energy principle for the brain." Journal of Physiology-Paris 100.1-3 (2006): 70-87. http://www.fil.ion.ucl.ac.uk/~karl/A%20free%20energy%20principle%20for%20the%20brain.pdf
6. Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
7. Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015. http://www.jmlr.org/proceedings/papers/v37/xuc15.pdf
8. Feldman, Harriet, and Karl Friston. "Attention, uncertainty, and free-energy." Frontiers in human neuroscience 4 (2010): 215. https://www.fil.ion.ucl.ac.uk/~karl/Attention%20uncertainty%20and%20free-energy.pdf
9. Feynman, Richard P. "Statistical Mechanics." Reading, MA: Benjamin (1972).
10. Friston, Karl J., et al. "Action and behavior: a free-energy formulation." Biological cybernetics 102.3 (2010): 227-260. https://link.springer.com/content/pdf/10.1007/s00422-010-0364-z.pdf
11. Friston, Karl. "What is optimal about motor control?." Neuron 72.3 (2011): 488-498. https://www.fil.ion.ucl.ac.uk/~karl/What%20Is%20Optimal%20about%20Motor%20Control.pdf
12. Friston, Karl, Spyridon Samothrakis, and Read Montague. "Active inference and agency: optimal control without cost functions." Biological cybernetics 106.8-9 (2012): 523-541. http://www.fil.ion.ucl.ac.uk/~karl/Active%20inference%20and%20agency%20optimal%20control%20without%20cost%20functions.pdf
13. Friston, Karl, and Ping Ao. "Free energy, value, and attractors." Computational and mathematical methods in medicine 2012 (2012). http://www.fil.ion.ucl.ac.uk/~karl/Free%20Energy%20Value%20and%20Attractors.pdf
14. Hoffman, Matthew D., et al. "Stochastic variational inference." The Journal of Machine Learning Research 14.1 (2013): 1303-1347.