Course:CPSC522/Adverserial Belief Propagation

Jump to: navigation, search

Title

This page models belief propagation in UGMs with complex factors which are adversarially learned from data.

Principal Author: Hooman Hashemi
Secondary Author:

Abstract

Discriminators in GANs define distributions. In the first part of this article a concrete method is provided to explore this distribution as a single monolithic factor. In the second part, ways to combine the factors into more complex undirected graphical models is mentioned. While discriminator defines a complex factor table, generator helps to explore this factor as a proposal distribution. In this model, as the factor is explored proposal distribution becomes more similar to the Discriminator distribution. It is intended to use the latent state of the generator in converged state as messages between factors.

Builds on

Adversarial BP is built on GANs. And it has an undirected structure similar to UGMs.

Content

Background

In undirected graphical models, exact local marginals (or beliefs) can be computed by eliminating (or marginalising) variables one by one. This operation forms a tree like structure of passed marginals here called messages and the operation is called belief propagation. The problem with exact inference is the increasing size of messages and cliques, after eliminating a variable all adjacent variables become dependent forming large cliques. Calculating messages is done by marginalising and iterating all possible assignments to these cliques. So the time complexity is exponentially growing with the clique sizes.

Loopy Belief Propagationn

Loopy BP uses the same updates, but in a graph with cycles instead of a clique-tree. It is no longer interpreted as marginalisation but instead an attempt to minimize an approximation to KL-Divergence of the distribution defined by messages and the distribution defined by the factors. Clique sizes can be smaller.

Single Factor

We first explain how to use a discriminator as a complex factor table, in a single factor setting via an adversarial search.

Adverserial Search

Intuition
1. Using discriminator feed-backs, converge the proposal distribution to discriminator's distribution (an alternative to finding best fitting marginals).
2. In this implementation, runtime is similar to training of a GAN. In each iteration generator (proposal) is updated based on feedback of discriminator. The discriminator adopts to the new state without training(by just conditioning on the state).
Objective Function

(${\displaystyle \theta ^{t}}$ is the runtime state not the parameters of network)

${\displaystyle -Loss(D)=\mathop {\mathbb {E} } _{I\sim p_{data(I)}}\mathop {\mathbb {E} } _{t,\theta ^{t}\sim p_{g}(\theta ^{t}|I,D)}(\mathop {\mathbb {E} } _{x\sim p_{data}(x|I)}[logD(x|\theta ^{t},I)]+\mathop {\mathbb {E} } _{x\sim p_{g}(x|\theta ^{t},I)}[log(1-D(x|\theta ^{t},I))])}$
So the optimal discriminator ${\displaystyle D^{*}(x|\theta ^{t},I)}$, defines the ratio: ${\displaystyle D^{*}(x|\theta ^{t},I)={\frac {p_{data}(x|I)}{p_{data}(x|I)+p_{g}(x|I,\theta ^{t})}}}$
(${\displaystyle \alpha log\beta +(1-\alpha )log(1-\beta )}$ is maximized when ${\displaystyle \alpha =\beta }$, here for a given ${\displaystyle x}$, apply this for ${\displaystyle \alpha ={\frac {p_{data}(x|I)}{p_{data}(x|I)+p_{g}(x|I,\theta ^{t})}}}$ and ${\displaystyle \beta =D^{*}(x)}$ )
${\displaystyle Loss(G)=\mathop {\mathbb {E} } _{I\sim p_{data(I)}}\mathop {\mathbb {E} } _{t,\theta ^{t}\sim p_{g}(\theta ^{t}|I,D)}(\mathop {\mathbb {E} } _{x\sim p_{data}(x|I)}[logD(x|\theta ^{t},I)]+\mathop {\mathbb {E} } _{x\sim p_{g}(x|\theta ,I)}[log(1-D(x|\theta ^{t},I))])}$
Putting ${\displaystyle D^{*}}$ into loss of ${\displaystyle G}$:
${\displaystyle =\mathop {\mathbb {E} } _{I\sim p_{data(I)}}\mathop {\mathbb {E} } _{t,\theta ^{t}\sim p_{g}(\theta ^{t}|I,D)}(\mathop {\mathbb {E} } _{x\sim p_{data}(x|I)}[log{\frac {p_{data}(x|I)}{p_{data}(x|I)+p_{g}(x|I,\theta ^{t})}}]+\mathop {\mathbb {E} } _{x\sim p_{g}(x|\theta ^{t},I)}[log({\frac {p_{g}(x|\theta ^{t},I)}{p_{data}(x|I)+p_{g}(x|\theta ^{t},I)}})])}$
${\displaystyle =\mathop {\mathbb {E} } _{I\sim p_{data(I)}}\mathop {\mathbb {E} } _{t,\theta ^{t}\sim p_{g}(\theta ^{t}|I,D)}JSD(p_{data}(x|I)||p_{g}(x|\theta ^{t},I))}$

After adding ${\displaystyle D}$ to the ${\displaystyle p_{g}(\theta ^{t}|I,D)}$, the model seemed unstable at first, but might not always be. (see stability)

Simple Implementation

${\displaystyle \varphi }$: encoding of ${\displaystyle I}$.
G-lstm: ${\displaystyle [\varphi ,\nabla _{\theta ^{t-1}}Loss(D)(x^{t-1}),\theta ^{t-1}]\rightarrow \theta ^{t}}$.
G2: ${\displaystyle [\theta ^{t},z]\rightarrow x^{t}}$
D: ${\displaystyle [x^{t},\theta ^{t}]\rightarrow }$ ratio

Instead of ${\displaystyle \nabla _{\theta }}$ the ratio or loss themselves can be given, based on applicatoin.

Attention and associative memory can be added, It will help to avoid local minimas

Stability, convergence and updates

${\displaystyle -L_{D}(D,G,\theta _{\theta })=L_{G}(G,D,\theta _{\theta })=\mathop {\mathbb {E} } _{I\sim p_{data(I)}}\mathop {\mathbb {E} } _{t,\theta ^{t}\sim p_{g}(\theta ^{t}|I,D)}-L_{D}(D,G,\theta ^{t},I)}$
${\displaystyle =\mathop {\mathbb {E} } _{I\sim p_{data(I)}}\mathop {\mathbb {E} } _{t,\theta ^{t}\sim p_{g}(\theta ^{t}|I,D)}(\mathop {\mathbb {E} } _{x\sim p_{data}(x|I)}[logD(x|\theta ^{t},I)]+\mathop {\mathbb {E} } _{x\sim p_{g}(x|\theta ^{t},I)}[log(1-D(x|\theta ^{t},I))]).}$
Where as stated,
${\displaystyle L_{G}(G,D,\theta ^{t},I)=-L_{D}(D,G,\theta ^{t},I)=(\mathop {\mathbb {E} } _{x\sim p_{data}(x|I)}[logD(x|\theta ^{t},I)]+\mathop {\mathbb {E} } _{x\sim p_{g}(x|\theta ^{t},I)}[log(1-D(x|\theta ^{t},I))]).}$
Here I use the format ${\displaystyle L_{D}(D|G,\theta _{\theta })}$ to represent the same functions. Define,
${\displaystyle \theta _{\theta }^{*}(D|G)(=\theta _{\theta }^{*}(D,G))\sim \mathop {argmax} _{\theta _{\theta }}L_{D}(D|\theta _{\theta },G),}$
assuming ${\displaystyle p_{g}(\theta ^{t}|\theta _{\theta },D,I)}$ can learn to produce a subset of maximum values of ${\displaystyle L_{D}(D|D,\theta ^{t},I)}$ with a high probability. And define,

${\displaystyle D^{*}(G)\sim \mathop {argmax} _{\theta _{d}}L_{G}(G|\theta _{\theta }^{*}(D|G),D),}$
again assuming ${\displaystyle D}$ can learn the probability ratios of ${\displaystyle p_{data}}$ and ${\displaystyle p_{g}}$. Then the following update steps should minimize the objective function of the generator.

1. First for a given ${\displaystyle D}$, optimize ${\displaystyle \theta _{\theta }}$ toward ${\displaystyle \theta _{\theta }^{*}(D|G)}$, by repeatedly sampling a sequence of ${\displaystyle \theta ^{t}}$ from ${\displaystyle p_{g}(\theta ^{t}|D,I)}$ using the recurrent model defined by ${\displaystyle \theta _{\theta }}$ and the information from iteration ${\displaystyle t-1}$, then sampling ${\displaystyle x^{t}}$ from ${\displaystyle p_{g}(x^{t}|\theta ^{t},I)}$ by using the model with the parameters ${\displaystyle \theta _{G}}$, calculating the loss and back-propagating. Repeat enough, by assumption ${\displaystyle \theta _{\theta }^{*}}$ should produce a subset of the argmax of ${\displaystyle L_{D}}$ with a high probability.
2. Take a small step from ${\displaystyle D}$ toward ${\displaystyle D^{*}(G)}$ in a similar manner, ${\displaystyle D}$ should learn the ratio between ${\displaystyle p_{g}(x^{t}|I,\theta ^{t})}$ and the actual ${\displaystyle p_{data}(x^{t}|I(,\theta ^{t}))}$. Repeat ${\displaystyle i}$ and ${\displaystyle ii}$. Note that from ${\displaystyle G}$ only the recurrent part is updated so far. The minimizer for ${\displaystyle L_{D}(D|G,\theta _{\theta }^{*}(D|G))}$ should be close to ${\displaystyle D^{*}(x|\theta ^{t})\sim {\frac {p_{data}}{p_{data}+p_{g|\theta ^{t}}}}}$ for all high probably ${\displaystyle \theta ^{t}}$, and ${\displaystyle L_{D}(D|G,\theta ^{t},I)}$ should be close to the ${\displaystyle JSD}$ for all such ${\displaystyle \theta ^{t}}$.
3. Take a small step from ${\displaystyle G}$ toward minimizing the objective, this should minimize the expectation of loss over ${\displaystyle L_{G}(G|D^{*}(G),\theta _{\theta }^{*}(D^{*}|G))}$. Since for high probably ${\displaystyle \theta ^{t}}$ the loss is close to the ${\displaystyle JSD}$ we have,
${\displaystyle \mathop {\mathbb {E} } _{I\sim p_{data(I)}}\mathop {\mathbb {E} } _{\theta ^{t}\sim p_{g}(\theta ^{t}|I,D)}-L_{D}(D|G,\theta ^{t},I)\sim \mathop {\mathbb {E} } _{I\sim p_{data(I)}}\mathop {\mathbb {E} } _{\theta ^{t}\sim p_{g}(\theta ^{t}|I,D)}JSD(p_{data}(x|I)||p_{g}(x|I,\theta ^{t})).}$

I'm assuming minimizing the expectation over the argmax should minimize the maximum which is the JSD,repeat ${\displaystyle i}$, ${\displaystyle ii}$ and ${\displaystyle iii}$.
Figure

For a given Discriminator, recurrent part (${\displaystyle \theta _{\theta }}$) learns to produce converged states (${\displaystyle \theta ^{t}}$close to peaks), then discriminator is updated to be close to ${\displaystyle D^{*}}$or ${\displaystyle JSD}$. Finally ${\displaystyle G}$ is updated to minimise the expected ${\displaystyle JSD}$.

Stability and updates of Adversarial Search

Single Factor Example

Every planar map has a Four Coloring. It is NP-complete to decide whether a planar map has a Three Coloring. Given image of a map ${\displaystyle I}$, generator ${\displaystyle G}$ can produce random colouring's which is easily verifiable by discriminator ${\displaystyle D}$ if it is a three colouring. Exhaustive search is possible with adversarial search over latent features, here tile colors.

Many of NP-complete problems can be good examples for adversarial search. Another example is finding or highlighting a Hamiltonian cycle in a graph specified by image ${\displaystyle I}$.

Many of real world problems have underlying NP-complete problems.

Advantages (Single factor)

The ultimate goal of this research is to be able to perform complex chained integrations similar to message passing in a graphical model.

The following advantages are the ultimate goal of this research, although not yet achieved I believe that they are realisable via this approach.

1. Chained integration. Similar to finding a posterior in a hierarchical Bayes model but in an undirected manner. This is possible by only switching factors by conditioning on different questions. Models fitted in other questions or factors can be used as incoming messages to solve new questions or factors.
2. Unsupervised learning of diverse factors. Although it is possible to perform hierarchical Bayes inferences, It is not straight forward to see or learn what are the factors. In this model adding diversity to the factors may be as easy as putting attention on different subset of data.(though both these advantages are problems need to be solved).

Further research and Problems

The following are problems that need to be solved.

1. Factors. To design problems or factors unsupervisedly
2. Multiple Factors. To design a protocol to send and receive messages. For example after deciding what model/message is the given information of the question/factor and what model/message is the answer, one can sample from question model/message and answer model/message separately and optimise the answer model/message.
3. Speed up. This model is slow as it is. One possible speed up is to move the factors to higher layers of the network. (e.g. deciding fakeness from higher level features of generator as a subconscious shortcut[not a refined idea])

Multi Factor (Not refined)

It is not trivial how to reach multi factor belief propagation from the single factor algorithm, and needs more work. Here are a few suggestions or possible application.

1. Conditioning as factors. By conditioning the GAN on different conditions as factors and asking to find a state satisfying all condition. This can be more elaborately made to be similar to existing message passing algorithms. The state can also be distributed between factors via associative memory.
2. Sequential factors. Similar to FastSLAM, factors can be current Missing inputs, states in associative memory can act as map(distributed based on similarity in situation).
3. Discriminator decides the factors. To solve a problem it is possible that discriminator puts attention on weaknesses of generator. The problem is solved when discriminator cannot find any unsatisfied condition(or factor). Or this can be done via active learning principles.

Other scenarios like manually factorised problems are also possible, for example while performing sum of two 10 digit numbers factors are visually local. Sum of local discriminator can be used as the final discriminator. Though all of these are imperfect and need more work.

Annotated Bibliography

Put your annotated bibliography here. Add links where appropriate.

To Add

Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.

 Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0