Course:CPSC532:StaRAI2020:ExactInference

Exact Inference in Graphical Models

Exact Inference are used to infer the posterior probability of a query given some evidence.

Authors: Obada Alhumsi

Abstract

Exact inference methods take advantage of the factorization of graphical models, factor operations and graph properties in order to infer the posterior probability of a query given some evidence in a time/memory effective manner. Two prominent methods are variable elimination and recursive conditioning.

Related Pages

Exact Inference is often used to make predictions in Graphical Models such as Bayesian networks and Markov random fields. This page will use conditional probability tables to simplify explanations of the material.

Content

Exact Inference Methods

Inference approaches in graphical models can be split into four main categories: exact inference, variational inference, stochastic simulation, and bounded approaches. The focus of this page is exact inference, where two prominent methods, variable elimination, and recursive conditioning will be discussed in depth. All examples and explanations will be based on Bayesian networks, however, these methods are perfectly applicable in Markov random fields and even constraint satisfaction problems^[1]^[2]. This means that the following exact inference methods can be applied to directed, undirected, cyclic and acyclic graphs.

Factors

Recall that in essence, a Bayesian network represents a factorization of joint probability distributions^[3]. A factor is defined as a function applied on a tuple of variables to produce a number^[3]. A factor $\phi _{c}$ on a variable $x_{c}$ can be written as $\phi _{c}(x_{c})$ , note that a factor can be applied on more than one variable. In exact inference methods for graphical models, we think of our graphical model as a product of factors^[4]:

p(x_{1},...,x_{n})=\prod _{c\in C}\phi _{c}(x_{c})

In Bayesian networks, these factors correspond to conditional probability distributions, while in Markov random fields, the factors correspond to an encoded unnormalized distribution^[4]. A Baysian network can then be written as the following:

p(x_{1},...,x_{n})=p(x_{1})\prod _{i=2}^{n}p(x_{i}|x_{i-1})

Given the former Baysian network, we could query the marginal probability of $p(x_{n})$ . The marginal probability of a variable can then be calculated using the following formula which leverages the factorization of the probability distribution ^[4]:

p(x_{n})=\sum _{x_{n-1}}p(x_{n}|x_{n-1})\sum _{x_{n-2}}p(x_{n-1}|x_{n-2})...\sum _{x_{1}}p(x_{2}|x_{1})p(x_{1})

Factors Operations

Given the equation above, we require a method to solve the equation in a time and memory-efficient way(note that the variables can also have discrete value assignments, i.e $p(x_{n}=True)$ ). The next sections will discuss such methods, however, we first need to cover three main factor operations used in these methods. These are variable assignment, factor multiplication, and factor marginalization. In order to explain these operations in a more effective manner, we will employ the use of conditional probability tables as seen in the figure below.

A simple Bayesian network with conditional probability tables

Assigning Variables

Let's first look at a factor on smoke, fire, and alarm discrete variables.

${\textstyle \phi _{c}(Smoke,Fire,Alarm)}$
Smoke	Fire	Alarm	Val
t	t	t	0.8
t	f	t	0.2
f	f	t	0.1
f	f	f	0.9

We can now assign a value to a variable and find the factor values of that assigned variable. In this instance, let's assign smoke to be true, we'll see that we longer have the rows where smoke is false in our table, and we also lose the smoke column as we know its assignment.

$\phi _{c}(Smoke=True,Fire,Alarm)$
Fire	Alarm	Val
t	t	0.8
f	t	0.2

Factor Multiplication

In factor multiplication, two or more factors can be multiplied together if they have a common factor, this will produce a new factor. We can write this mathematically as :

\phi _{1}(Fire,Alarm)*\phi _{2}(Smoke,Alarm)=\phi _{3}(Fire,Smoke,Alarm)

Factor Multiplication Simple Example
Fire	Alarm	Val	x	Smoke	Alarm	Val	=	Fire	Smoke	Alarm	Val
t	t	0.9	-	t	t	0.9	-	t	t	t	0.9*0.9
t	f	0.1	-	f	f	0.1	-	t	f	f	0.1*0.1

Factor Marginalization

In factor marginalization, we sum out a variable from a factor to produce a new factor that does not contain that variable. A simple example demonstrating this can be seen in the table below where we sum out the alarm variable(In entries where the other two variables are similar, but the alarm is different, we add both values and delete the alarm column).

Factor marginalization simple example
Smoke	Fire	Alarm	Val	-	Smoke	Fire	Val
t	t	t	0.9	-	t	t	0.9+0.05
t	f	t	0.9	-	t	f	0.9+0.1
t	f	f	0.1	-
t	t	f	0.05	-

Variable Elimination

Suppose we have a Baysian network with variables $X_{1},...,X_{n}$ ; we would now like to calculate the posterior probability of a variable or variables $Q$ given some evidence on some variables $Y_{1},...,Y_{j}$ (note that these variables are variables of the baysian network $X_{1},...,X_{n}$ )^[3] . We are thus trying to compute $P(Q,Y_{1}=v_{1},...,Y_{j}=v_{j})$ . Lets define the variables with no evidence as $Z_{1},...,Z_{k}$ , where these elements are ordered according to an elemination ordering(elimination ordering is an NP hard problem that won't be discussed in this page^[4]). We can then define:

p(Q,Y_{1}=v_{1},...,Y_{j}=v_{j})=\sum _{Z_{k}}...\sum _{Z_{1}}P(X_{1},...,X_{n})_{Y_{1}=v_{1},...,Y_{j}=v_{j}}

where:

P(X_{1},...X_{n})=\prod _{i=1}^{n}P(X_{i}|parents(X_{i}))

Algorithm

It is now possible to define the variable elimination algorithm in a concise manner. In laymen's terms, we loop over the variables according to the elimination ordering, eliminating the variables as we go. Formally we can represent the algorithm as such^[3]:

Create a factor for each conditional probability distribution in our network
Set the value of evidence variables to their observed values
Marginlize(sum out) the non-observed variables according to our elimination ordering( We sum out our variable by partitioning the factors in our equation to those that contain the variable and those that dont. The factors that contain our variable are multiplied, and our variable is then summed out from this new factor.)
Multiply the remaining factors and normalize

Two examples of this algorithm in action can be seen in the video below.:

Time Complexity

The running time of variable elimination is heavily dependent on the size and structure of our Bayesian network. In variable elimination, we are required to multiply factors to form a bigger factor over many variables. This relationship is exponential, meaning that if we are to multiply two factors $p(g|i,d),p(l|g)$ to form a factor $\phi (d,i,l)$ , it would require $O(k^{4})$ time to compute^[4]. This clearly shows the importance of ordering as it will change the number of variables in the factors we are multiplying. However, this is an NP-hard problem, so we can settle for heuristic methods for now, such as min-neighbors, where we order variables by the minimum number of dependent neighbors.

In total, the running time of variable elimination is $O(nk^{M+1})$ where n is the number of variables, M is the maximum size of any factor formed in our elimination process, and $k$ is the number of possible values each variable can take.^[4]

Applications & Limitations

The main application of variable elimination in terms of graphical models is to find the posterior probability distribution of some query variables given some evidence. There are multiple limitations to the basic form of the algorithm. Firstly, as we do not know the most optimum way to reduce the number of variables in our factors during the elimination process, we cannot guarantee the optimum performance of the algorithm. Secondly, this algorithm is not capable of dealing with continuous variables as can be seen by the $k$ variable in the running time O-notation. Lastly, networks with a large number of interconnected variables will have large factors, making the algorithm exponentially slower.

It's important to note that both variable elimination and recursive conditioning are methods that reduce the algorithmic complexity of exact inference in graphical models. Thus, real-world applications are applications of graphical models. Nonetheless, as variable elimination is an exponential algorithm, its use is restricted for graphical models of small size(tree-width and memory requirements). These can include a medical prognosis given evidence from a few tests, and whether or not to open a sprinkler given weather forecast and other variables.

Recursive Conditioning

Recursive conditioning can be thought of as the search variant of variable elimination, both algorithms use the same factor operations, however, recursive conditioning takes a "divide and conquer" kind of approach^[5]. If we have a factor graph with variables for nodes and factors for arcs between nodes, then if an assignment disconnects the graph into separate components, these components can be evaluated separately, this is shown in the figure below.

This image shows the approach of recursive conditioning in a simplified manner

Lets assume that our goal is to compute the probability of node E, if this computation is made under assumption B, we can reduce the network to single connected network as seen in the middle layer of the figure above^[6]. Furthermore, we also reduce the size of the computational probability table due to cutting that arc. Such assumptions taken will reduce the size of the network and is equivalent to the previous network for calculating the probability of node E^[6]. This process is called recursive decomposition, and it reduces the computation of probabilities into single-node networks.

Algorithm

The algorithms takes in evidence and a set of factors (conditional probability distributions) as input, and outputs the value of summing out the other variables in the network^[5]. In laymen's terms, the algorithm cuts the network where an assignment disconnects the graph, we then compute the $\prod$ of each separate network as in variable elimination and sum the results of the separate networks. The algorithm is more complex than what was just stated due to using a cache that saves already computed values and uses that memory to make the algorithm as optimum as possible.

Time Complexity

Comparing the algorithms together, recursive conditioning can take more time or less time than variable elimination, depending on the network properties and structure(time complexity can be linear or exponential). However, Recursive conditioning will take less space than variable elimination as it recalls, remembers, and forgets variables it's computed. Recursive conditioning offers a smooth tradeoff between the time of computation and the memory used^[6].

Applications and Limitations

The main application of recursive conditioning in terms of graphical models is to find the posterior probability distribution of non-assigned variables given context and a set of factors. It is helpful in situations where memory could be an issue, as it can be used to trade-off speed and memory. In some configurations, recursive conditioning offers a linear time complexity which is better than variable elimination. However, as in variable elimination, recursive conditioning can have exponential computational complexity and can even perform worse than variable elimination.

Keeping in mind that recursive conditioning is a complexity reduction algorithm (as stated in the applications section of variable elimination), recursive conditioning can be used on graphical models of larger size given that they satisfy a few conditions(these conditions won't be stated as recursive conditioning was not fully explained). In the paper by Darwiche, recursive conditioning was used on genetic linkage analysis datasets, that are large (containing 9000 variables) and require a memory size of over 20 GB using variable elimination. Recursive conditioning was able to compute the probability of some evidence with respect to the network using a memory size of 112 MB.

Annotated Bibliography

↑ "Variable elimination, Wikipedia Page".
↑ "Variable Elimination, Artificial Intelligence: Foundations of Computational Agents, Poole & Mackworth".
↑ ^3.0 ^3.1 ^3.2 ^3.3 "Variable Elimination for Belief Networks".
↑ ^4.0 ^4.1 ^4.2 ^4.3 ^4.4 ^4.5 "Variable Elimination CS228 class notes".
↑ ^5.0 ^5.1 "Recursive Conditioning, David Poole Lectures" (PDF).
↑ ^6.0 ^6.1 ^6.2 Darwiche, Adnan. "Recursive conditioning" (PDF). Artificial Intelligence.

To Add

Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.

Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0

[1] "Variable elimination, Wikipedia Page".

[2] "Variable Elimination, Artificial Intelligence: Foundations of Computational Agents, Poole & Mackworth".

[:0-3] 3.0 ^3.1 ^3.2 ^3.3 "Variable Elimination for Belief Networks".

[:1-4] 4.0 ^4.1 ^4.2 ^4.3 ^4.4 ^4.5 "Variable Elimination CS228 class notes".

[:2-5] 5.0 ^5.1 "Recursive Conditioning, David Poole Lectures" (PDF).

[:3-6] 6.0 ^6.1 ^6.2 Darwiche, Adnan. "Recursive conditioning" (PDF). Artificial Intelligence.

[1]

[2]

[3]

[4]

[5]

[6]