# Course:CPSC522/Problog

# ProbLog

**Principal Author**: Junyuan Zheng

**Second Author**: Samprity Kashyap

# Abstract

In this page, I will introduce ProbLog, a probabilistic extension of Prolog. ProbLog can be regarded as a database system that supports both probabilistic and inductive reasoning through a variety of querying mechanisms; it has several advantage over traditional logic programming, which is achieved by introduction of an effective solver for computing the success probabilities. It essentially combines SLD-resolution with methods for computing the probability of Boolean formulae, it also employs an approximation algorithm that combines iterative deepening with binary decision diagrams.

I will first introduce some basic knowledge and theory to help the reader understand. And then introduce what is ProbLog, its basic syntax and examples, and the most important function of the ProbLog-computing the success probabilities.The final part will be comparing the ProbLog with other logic programming framework, which will illustrate the reason that ProbLog surpass the probabilistic Datalog (pD).

# Content

## Background Knowledge

### Boolean Algebra

In mathematical logic, Boolean algebra is the branch of algebra in which the values of the variables are truth value or false value, usually represented by 1 or 0 respectively. The main operations of Boolean algebra are the conjunction 'and', denoted , the disjunction 'or', denoted , and the negation 'not', denoted . It is thus a formalism for describing logical relations in the same way that ordinary algebra describes numeric relations^{[1]}.

Here are some basic laws that we will used later^{[1]}:

### Disjunctive Normal Form (DNF)

You can find more about the Disjunctive Normal Form here

A **clause** that contains only is called a **disjunctive clause**, and a clause that contains only is called a **conjunctive clause**^{[2]}. For instance:

- is a disjunctive clause
- is a conjunctive clause

If we put a bunch of disjunctive clauses together with , it is called **conjunctive normal form**.

- in conjunctive normal form.

Similarly, putting conjunctive clauses together with , it is called **disjunctive normal form**.

- is in disjunctive normal form.

### Binary Decision Diagram (BDD)

You can find more about the Binary Decision Diagram on the Wikipedia

A binary decision diagram (BDD) is a data structure that is used to represent a Boolean function, which contains several decision nodes and 2 terminal nodes, 0-terminal and 1-terminal.

Each decision node has two child nodes called low child and high child. The edge from node to a low child represents an assignment of node to , and the edge from node to a high child represents an assignment of node to .^{[3]}.

A path from the root node to the 1-terminal represents a variable assignment for which the represented Boolean function is true.

### Selective Linear Definite clause Resolution (SLD Resolution)

This document is very helpful for you to understand the SLD Resolution.

SLD resolution is the primary and the most important rule that used in logic programming.

Before we introduce the ProbLog, it is necessary to have some basic idea about logic programming. In logic program, there are four principal components, they are **probabilistic facts**, **rules**, **background information**, and **query**.

For instance^{[4]},

1 % Rules 2 grandfather(X,Z) :- father(X,Y), parent(Y,Z). % To prove grandfather(X,Z), prove father(X,Y) and then prove parent(Y,Z) 3 parent(X,Y) :- father(X,Y). % If father(X,Y) is true, then parent(X,Y) is true 4 parent(X,Y) :- mother(X,Y). % If mother(X,Y) is true, then parent(X,Y) is true 5 6 % Background Information 7 father(a,b). % 'a' and 'b' are father and son/daughter or son/daughter and father 8 mother(b,c). % 'b' and 'c' are mother and son/daughter or son/daughter and mother 9 10 % Query 11 query(grandfather(a,X)) % Try to find the X, who is a's grandparents

This looks like a complicated problem (actually, it is not), however, by using SLD-resolution, we can solve the problem quickly.

Our goal is to find the a's grandfather (or X's grandfather):

However, based on the first 'Rule': `grandfather(X,Z) :- father(X,Y), parent(Y,Z).`

. We only need to find:

as the 'rule' imply: to prove , prove and then prove . Therefore, our derived goal is to prove: `father(C,E), parent(E,D).`

Let’s first make sure , based on the information: and one of the ‘background information’ , if , then .

Now, we know, , next we will try to find the to make the true. But now has two possible solutions based on the fact:

parent(X,Y) :- father(X,Y). parent(X,Y) :- mother(X,Y).

We need to divide the into two branches:

As we can see from the picture. One of the branches is contradict with the background information `father(a,b).`

, therefore, it's a failure branch, we have to delete that.

As for the , we can find a value `mother(b,c)`

from the background information for , to make sure is , and the .

Here is conclusion for this example, the is the father of , and is the mother of , therefore, is the grandfather of .

If we represent this procedure in an SLD-tree, which will be utilized in the next section, it looks like this:

## ProbLog

ProbLog is a recent probabilistic extension of Prolog, which can be regarded as a database system that supports both probabilistic and inductive reasoning through a variety of **querying mechanisms**^{[5]}.

To help the reader understand the ProbLog, it's necessary to introduce some basic knowledge.

### Why we need to use ProbLog

A motivation application for ProbLog is in biology, in which enormous amounts of molecular biological data are available. They contain information about various types of objects, such as genes, proteins, tissues, organisms, biological processes, and molecular functions, sometimes we also know their relationships. For instance, gene A of organism B can be coded for protein C, which exists in tissue D, by the biochemical process.

Mining this data has been identified as an important and challenging task. Such a collection of interlinked heterogeneous biological data can be conveniently seen as a weighted graph or network of biological concepts, where the weight of an edge (an edge is used to connect two nodes together) corresponds to the probability that the corresponding nodes are related^{[5]}.

A query that scientist wants to ask from this database is whether a given gene is connected to a given disease. By using ProbLog, such query can be easily expressed in logic by asking whether there is a path between a gene and a disease, for instance, "whether there is a path between 'gene 620' and 'disease Alzheimer'". Since edges are expressed by using probability, the query has a certain success probability. Moreover, this probability corresponds to the likelihood that a path exists between those two nodes, sees 'gene 620' and 'disease Alzheimer'^{[5]}.

### Basic Syntax and Example

The syntax and semantics of the **ProbLog** programming language are the set of rules that defines how to write and interpret ProbLog program.

#### Facts and Rules

Like any ProLog program, it consists of two parts: a set of **probabilistic facts** and a set of **rules** (deterministic rules, as in regular Prolog).

A rule is of the form:

Head :- Body.

and is read as "Head is true if Body is true".

Clauses with empty bodies are called **facts**. An example of a **fact** is:

0.5::coin.

which is equivalent to: coin lands on heads with probability 0.5

#### Use ProbLog to "Toss" Coins

In this case, we need one probabilistic fact for each coin being tossed. For instance^{[6]}:

1 % Probabilistic facts: 2 0.5::heads1. % indicates that the first coin lands on heads with probability 0.5 3 0.6::heads2. % indicates that the second coin lands on heads with probability 0.6 5 4 % Rules: 5 twoHeads :- heads1, heads2. % both coins to land on heads |

If we want to know the probability of first coin lands on head , second coin lands on head , and both coins landing on heads . We only need to run the **query**.

% Queries: query(heads1). query(heads2). query(twoHeads).

We can get result: , and . The probability of both coins landing on heads is . You can try the example here

If we are now asking for the posterior probabilities of the query given the evidence that the **both coins landing on heads** is false. We only need to add an **Evidence**

% Evidence evidence(twoHeads, false).

After we rerun the code, we can get result: , and . And of course The probability of both coins landing on heads is

#### More General Condition

Let’s consider a more general condition. Suppose we toss N coins (which are all biased with probability 0.6 for heads) and we are interested in the probability of at least one of them landing on heads. We can model this in ProbLog by using the code:

1 % Probabilistic facts: 2 0.6::heads(C) :- coin(C) 3 4 % Background information: 5 coin(c1). 6 coin(c2). 7 coin(c3). 8 coin(c4). 9 10 % Rules: 11 someHeads :- heads(_). 12 13 % Queries: 14 query(someHeads).

Note that

0.6::heads(C) :- coin(C).

means for each instantiation , there is a corresponding probabilistic fact . If we assume that we have N coins, the probability of at least one coin landing on heads is . In this ProbLog program, the . If we run the code, the .

#### Usage in Probabilistic Graphs^{[7]}:

The most important usage of ProbLog is in Probabilistic Graphs, as we can see in the picture.

For instance, the node No.1 is ‘gene 620’ and node No.6 is 'disease Alzheimer', if we want to know the ‘probability of a direct connection’ between the gene and the disease, we only need to, as introduced before, use ‘query(path(gene 620, disease Alzheimer)).’, which will return a probability of the path exist.

In ProbLog, we can achieve this by using the code:

1 0.6::edge(1,2). 2 0.1::edge(1,3). 3 0.4::edge(2,5). 4 0.3::edge(2,6). 5 0.3::edge(3,4). 6 0.8::edge(4,5). 7 0.2::edge(5,6). 8 9 path(X,Y) :- edge(X,Y). 10 path(X,Y) :- edge(X,Z),path(Z,Y). 11 12 query(path(1,6)).

The program illustrates the graph with six nodes and seven probabilistic edges. The ProbLog program uses the regular Prolog definition of path. If we run this ProbLog program, we can get .

### Probabilistic Inference

In this section, I will introduce the core algorithms and techniques to calculate the success probabilities and most likely explanations of queries. However, first, I need to present an example, which will be used in the later discussion.

% Example 1 1.0:: likes(X,Y):- friendof(X,Y). % If friendof(X,Y) is true, then likes(X,Y) is true. 2 0.8:: likes(X,Y):- friendof(X,Z), likes(Z,Y). % If friendof(X,Z) and likes(Z,Y) is true, then likes(X,Y) has 80% to be true. 3 0.5:: friendof(john,mary). % the probability of fridendof(john,mary) is true is 50% 4 0.5:: friendof(mary,pedro). % the probability of fridendof(mary,pedro) is true is 50% 5 0.5:: friendof(mary,tom). % the probability of fridendof(mary,tom) is true is 50% 6 0.5:: friendof(pedro,tom). % the probability of fridendof(pedro,tom) is true is 50% 7 8 % Query 9 query(likes(john,tom)). % We wants to know whether john likes tom.

In traditional logic programming, we can only know whether a query succeeds or fails. However, in ProbLog, as introduced before, we are can compute the probability that it succeeds.

Next, I will introduce two methods used by ProbLog to calculate the probability, and I will use the example, presented above, to help the reader understand the algorithms and concepts.

#### Exact Inference

As for the **Exact Inference**, the ProbLog uses a method involving two steps. The first is concerned with the convert the **query** into a DNF formula (Disjunctive Normal Form, introduced before). The second component computes the probability of this DNF formula.

##### ProbLog queries as DNF formulae

To present ProbLog queries as DNF formulae, the ProbLog employs SLD-resolution (Selective Linear Definite clause Resolution, introduced before), from which the execution mechanism of Prolog is derived. For this example, the SLD-tree for the query(likes(john,tom)). is depicted below.

The paths from the root to individual leaves of the SLD-tree represent either a successful or a failed proof, as you can see in the picture, if the test is failed, we will discard the result.

The original paper explained this method very well but hard to understand. Therefore, I will try to use my word to explain how to convert the SLD-tree into a DNF formulae. First, let's delete all the unnecessary part, leaves represent failed proof, of the SLD-tree.

There are two branches in this picture that can result successful proof, te first one is through `likes2-friendof1-likes1-friendof3`

, the second one is through `likes-friendof1-likes2-friendof2-likes1-friendof4`

. If we want to use mathematic method to represent the query has a successful proof, we can use equation like this:

By using **Boolean Algebra**, we can get:

This equation is a **DNF formula**, for we already know , therefore, we can further simplify this equation:

And here is the final **DNF formula**, which will be used in the second component: **Computing the probability of DNF formulae**.

##### Computing the probability of DNF formulae

Computing the probability of DNF formulae is an NP-hard problem even if all variables are independent^{[5]}. There are several algorithms for transforming a disjunction of conjunctions into mutually disjoint conjunctions, for which the probability is obtained just as a sum. One basic approach relies on the inclusion-exclusion principle
from set theory. It requires the computation of conjunctive probabilities of all sets of conjunctions appearing in the DNF formula. This is clearly intractable in general. There are also some advanced methods. However, these algorithms seem to be limited to a few dozens of variables and a few hundreds of sums.

In ProbLog, the designer utilized the advances made in the manipulation and representation of Boolean formulae using binary decision diagrams (BDD, which is introduced before).

A BDD is an efficient graphical representation of a Boolean function over a set of variables. A BDD representing the formula:

is in the figure, shown below:

Given a BDD, it would be much more easy to compute the probability of the corresponding Boolean function by traversing the BDD from the root node to a leaf. At each node, probabilities from both children node (low children node and high children node) are calculated recursively and combined. And here is the pseudo code^{[5]}:

% Probability(input: BDD node n ) 1 if is the 1-terminal then return 1 2 if is the 0-terminal then return 0 3 let and be the high and low children of n 4 Probability() 5 Probability() 6 return

I implemented this BDD and algorithm in a Java program, you can find the file here, the result is , and here is part of the code:

Node terminal0 = new Node(null, null, 0); Node terminal1 = new Node(null, null, 0); Node f3 = new Node(terminal0, terminal1, 0.500000); Node f4 = new Node(terminal1, f3, 0.500000); Node f2 = new Node(f4, f3, 0.500000); Node f1 = new Node(f2, terminal0, 0.500000); Node l2 = new Node(f1, terminal0, 0.800000); public double Probability(Node node) { if (node==terminal1) return 1.0; if (node==terminal0) return 0.0; Node h = node.h; Node l = node.l; double probH = Probability(h); double probL = Probability(l); return node.TProbability*probH + (1.0-node.TProbability)*probL; }

or you can use the original ProbLog program to test the output result. This algorithm can be applied to ProbLog programs containing hundreds of clauses and tens of thousands of proofs.

#### Approximation Inference

The former method, Exact Inference, is only useful in small ProbLog program, the reason is: for big ProbLog program, the size of the monotone DNF formula can explode, therefore, it will be time-consuming to fully calculate the DNF formula. In this section, I will introduce an algorithm that used in ProbLog to approximating the success probability of queries.

##### Three observations we get from 'Exact Inference'

###### First Observation

First, in the **query(likes(john,tom))** example, after we get a NDF formula:

We remove from the equation directly, for clasuses has a probability 1, thereby simplify the DNF formula.

Therefore, the first obvious observation leading towards a faster algorithm is: **one can remove all Boolean variables that correspond to clauses with probability 1 ^{[5]}**.

###### Second Observation

Second and more important observation allows us to **eliminate complete proofs from the DNF formula**. For instance, considering a DNF formula that we need to calculate:

During the computation of the SLD-tree, which proceeds depth-first, from left to right in Prolog, the program also keeps track of a DNF formula that represents the DNF of the already computed proofs:

If can be represented by using , then we can remove without any further calculate consideration, for we already know the and separetly, the is simply .

Because the program using monotone DNF formulae, this condition can be checked efficiently by verifying that the need to calculate proof is not entailed by any of the conjuncts in ^{[5]}.

This observation motivates the use of iterative deepening instead of depth-first search to compute the SLD-tree and the corresponding DNF formula (you can find the difference here). In this way, it becomes more likely that later proofs will be subsumed by already computed ones. Iterative deepening essentially proceeds as depthfirst search but does not expand goals in the SLD-tree whose depth exceeds a threshold. It then iteratively increases this depth-bound.

###### Third Observation

The final observation, it's the crucial one that leads to approximation of success probabilities, that is an imcomplete SLD-tree (during iterative deepening) can be used to derive an upper and a lower bound on the success probability of the query. This observation and the corresponding algorithm are related to work by David Poole in his Logic programming, abduction and probability. For ProbLog, the designer used two DNF formula from the incomplete SLD-tree to find the lower and upper bound. The first DNF formula encodes the successful proofs already occurring in the tree. The second DNF formula encodes the successful proofs already occurring in the tree as well as the proofs that have been cut off^{[5]}.

I will use the **query(likes(john,tom))** example to explain this idea. Consider the SLD-tree in the example only have depth 4. In this case, encodes the success path while additionally encodes the paths up to likes(pedro,tom) and likes(tom,tom), which will be reject in the former cases.

We can get:

Therefore, we can get:

##### Algorithm

Based on the three observations that introduced before, and to guarantee better approximations, the designers of the ProbLog bound the depth of SLD-trees with the approximations. Here is the pseudo code:

Approximate(query q; program T; bound ) 1 depthbound := 1; 2 d1 := false 3 repeat 4 d2 := d1 5 call Iterate(q, true, depthbound, d1, d2) 6 p1 := call Probability(d1) 7 p2 := call Probability(d2) 8 increment depthbound 9 until 10 return p1, p2

This Approximate function defined an **error**, if the difference between the higher bound and the lower bound is greater than the error value , then it will increase the depth until the requirement is satisfied.

## Comparing ProbLog with other framework

As the designers said, the ProbLog framework is not something entirely new, it has close connections with several other frameworks. For instance, **PHA** (designed by David Poole), **PRISM** (designed by T. Sato and Y. Kameya), **SLPs** (designed by S. Muggleton), and **probabilistic Datalog**, which is also called **pD** (developed by Fuhr).

The most intimate example related to the ProbLog is the probabilistic Datalog; just like ProbLog, it does not impose various constraints on probabilities, however, because its inference engine's limitation^{[5]}, **which I will introduce in this part**, it cannot be used to solve complex problems. Since pD employs a naive approach based on inclusion-exclusion for computing the probabilities of these formulae, the evaluation of about **10** or more conjuncts is infeasible in the implementation of pD according to the paper. In contrast, ProbLog’s approximation algorithm can deal with formulae containing up to **100000** formulas.

The pD also use two steps to calculate the probability, the first step is to transform the event expression into disjunctive normal form (DNF), this is as same as the ProbLog, however, for the second steps, it calculates the DNF formula in a theoretic way^{[8]}. And here is an example, given a DNF formula:

Based on the Boolean Algebra, we can use this equation to calculate the DNF formula:

In case there are more than two sets, this strategy is applied recursively, e.g.

And here is a general equation, given a DNF formula^{[8]}:

As you can imagine, with the increase the number of variables, the equation will become more and more complicated, and this is the main reason that constrained its real-life applications. And ProbLog solved this limitation by:

- transferring the DNF formula into the BDD tree and then perform the calculation.
- using approximation algorithm.

# Annotated Bibliography

- ↑
^{1.0}^{1.1}Wikipedia: Boolean Algebra - ↑ Greg Baker, Normal Forms
- ↑ Binary Decision Diagram
- ↑ Coen De Roover, Logic Programming
- ↑
^{5.0}^{5.1}^{5.2}^{5.3}^{5.4}^{5.5}^{5.6}^{5.7}^{5.8}Luc De Raedt, Angelika Kimmig and Hannu Toivonen, ProbLog: A Probabilistic Prolog and its Application in Link Discovery, 2007 - ↑ ProbLog Tutorial: Tossing Coins
- ↑ ProbLog Tutorial: Probabilistic Graphs
- ↑
^{8.0}^{8.1}Norbert Fuhr, Probabilistic Datalog: Implementing Logical Information Retrieval for Advanced Applications, 1999