# Inductive Logic Programming

Author: Ritika Jain

Papers:

• Logic Programs as a basis for Machine Learning, Claude Sammut, 1988
• Learning to Parse Database Queries Using Inductive Logic Programming, John M. Zelle, Raymond J. Mooney, 1996

Inductive Logic Programming (ILP) is an area at the intersection of Machine Learning and Logic Programming.

## Abstract

Machine Learning encompasses broadly two approaches of learning: data driven and model driven. Data driven approach tires to generalize by finding common patterns in a sets of examples whereas Model driven approach uses a model built on concepts derived from the specific domain to form generalizations. ILP can be seen as an approach to induction which unifies these two different schools of learning. It looks for regularities among data using a search which is assisted by background knowledge of the concepts already learned.

## Introduction

The languages that can be used to accomplish the above should have a common property: the language should be able to grow with the system. To ensure that the limits of the language does not bound the limits of the system being described, the description language needs to be able to grow i.e. there should be flexibility to add new terms to the language.

There should not be a pre-defined and fixed "vocabulary", because when the limits of the language are reached, so too are the limits of the system as a whole. For example, LEX (Mitchell, Utgoff, Nudel and Banerji, 1981) used a fixed language to describe algebraic expressions. When it was found that LEX was unable to learn some concepts because the language could not describe the necessary expressions, a method for extending the language had to be found. [1]

fx(leaves,branches)

Definition of some terms which will be used throughout the article:

Concept
all those things which a program is intended to learn
Concept description language
method of representation
Primitives
a set of attributes/terms to begin with

We say that a language has the capability to grow when a newly learned concept is allowed to be used as a primitive.

fy(tree,flowers,birds)

Example: To be able to use a concept description in place of a primitive, it must be possible to interpret the description as a recognition procedure. If we learned a new concept 'tree' from the initial primitives 'leaves' and 'branches' then a good learning language will allow me to refer to 'tree' in the description of a garden scene and have the 'tree' program recognize all the trees in the scene.
${\displaystyle f_{x}(leaves,branches)\to tree}$
${\displaystyle f_{y}(tree,flowers,birds)\to garden}$

This allows for complicated representations without having to specify the primitives again. Such a language can be said to possess the capability to grow when the system grows.

### Inverting Resolution

Resolution provides an efficient means of deriving a solution to a problem, giving a set of axioms which define the task environment. Whereas resolution takes two terms and resolves them into a most general unifier, anti-unification finds the least general generalization of two terms.[1]
Example: The least generalization of

${\displaystyle p(g(a),a)}$ and ${\displaystyle p(g(b),b)}$
is
${\displaystyle p(g(X),X)}$

However, this method of generalization is based on the idea of subsumption and that has its limitations as shown in the example below: [2]
Example: Suppose we are given two instances of a concept called cuddly_pet,

${\displaystyle cuddly\_pet(X)\gets fluffy(X)\wedge dog(X)}$ (1)
${\displaystyle cuddly\_pet(X)\gets fluffy(X)\wedge cat(X)}$ (2)

Suppose we also know the following:

${\displaystyle pet(x)\gets dog(X)}$ (3)
${\displaystyle pet(x)\gets cat(X)}$ (4)

According to subsumption, the least general generalization of (1) and (2) would be,

${\displaystyle cuddly\_pet(X)\gets fluffy(X)}$ (5)

However, if we consider the background knowledge given by (3) and (4), we realize that this is overgeneralization and incorrect because it implies that any object (say a pen) could be considered a fluffy object. The appropriate would be therefore,

${\displaystyle cuddly\_pet(X)\gets fluffy(X)\wedge pet(X)}$

In light of this, Sammut in his paper, took the conservative view and said that generalisation should only be done when relevant background knowledge is available.

### Absorption

Given a set of clauses, the body of one of which is completely contained in the bodies of the others, such as:[1]

${\displaystyle X\gets A\wedge B\wedge C\wedge D\wedge E}$
${\displaystyle Y\gets A\wedge B\wedge C}$

we can hypothesize:

${\displaystyle X\gets Y\wedge D\wedge E}$
${\displaystyle Y\gets A\wedge B\wedge C}$

### Intra-construction

This is similar to the law of Boolean equations. We can take a group of rules all having the same head, such as: [1]

${\displaystyle X\gets B\wedge C\wedge D\wedge E}$
${\displaystyle X\gets A\wedge B\wedge D\wedge F}$

and replace them with:

${\displaystyle X\gets B\wedge D\wedge Z}$
${\displaystyle Z\gets C\wedge E}$
${\displaystyle Z\gets A\wedge F}$

Note: In intra-construction, a new term is automatically created to simplify descriptions. This is coherent with our expectation of growing language explained in the initial part of the introduction. At any time, during induction there may be a number of applicable operators and the second paper tries to learn which operators to apply when while parsing database queries.

# Background

## Deductive v/s Inductive Reasoning

Inductive reasoning is also known as hypothesis construction because any conclusions made are based on current knowledge and predictions. As with deductive arguments, biases can distort the proper application of inductive argument, thereby preventing the reasoner from forming the most logical conclusion based on the clues.Given that "if A is true then B, C, and D are true", an example of deduction would be "A is true therefore we can deduce that B, C, and D are true". An example of induction would be "B, C, and D are observed to be true therefore A may be true". A is a reasonable explanation for B, C, and D being true.[3] To put things in perspective, I have shown below how Inductive reasoning differs from the deductive reasoning.

Inductive v/s Deductive reasoning

## ILP: Objective

Given a dataset:

• Positive examples (E+) and optionally negative examples (E-)
• Set of constraints to make the learning process more efficient (C)

Goal of an ILP system is to find a set of hypothesis that:

• Covers (explains) the positive examples - Completeness
• Are consistent with the negative examples - Consistency

${\displaystyle h\in H:\forall _{p}\in P:covers(h,p)\wedge \forall _{n}\in N:\neg covers(h,n).}$[4]

## ILP Algorithm

The fundamental idea of the algorithm is to keep track of candidate hypotheses. It repeatedly expands the hypotheses using inference rules, the expanded hypotheses are then added to the queue of hypotheses which may be pruned to discard unpromising hypotheses from further consideration. This process continues until the stopping criterion is satisfied. Formally, the algorithm is as follows:[5][6]

 1  QH := INITIALIZE
2  repeat
3
4 DELETE H from QH
5 CHOOSE the inference rules ${\displaystyle r_{1},..,r_{k}\in R}$ to be applied to H
6  Apply the rules ${\displaystyle r_{1},..,r_{k}}$ to H to yield ${\displaystyle H_{1},H_{2},...,H_{n}}$
7  Add ${\displaystyle H_{1},..,H_{n}}$ to QH
8  PRUNE QH
9
10  until STOPPINGCRITERION (QH) is satisfied


The algorithm has the following parameters:

• INITIALIZE denotes the hypothesis started from.
• R denotes the set of inference rules applied
• DELETE influences the search strategy.
• CHOOSE determines the inference rules to be applied on the hypothesis H.
• PRUNE determines which candidate hypotheses are to be deleted from the queue.
• The Stop-criterion states the conditions under which the algorithm stops. Some frequently employed criteria require that a solution be found, or that it is unlikely that an adequate hypothesis can be obtained from the current queue.

# Content

## Problem definition

The objective is to automate the construction of a natural-language interface for database queries using CHILL parser acquisition system. CHILL (Constructive Heuristics Induction for Language Learning) is a general approach to the problem of inducing natural language parsers. For more information on CHILL, please refer to the section below. Starting with a general framework for constructing a suitable logical form, CHILL is able to train on a corpus comprising sentences paired with database queries and induce parsers that map subsequent sentences directly into executable queries.[7] The corpus-based or empirical approach replaces hand-generated rules with models obtained automatically by training over language corpora. Essentially, the authors aim to use CHILL to engineer a natural language front-end for a database-query task. The database that they chose is called Geobase which is a United States geography database. The system produced is easily evaluable as all that needs to be checked is if it provides a correct answer for a given database query (question).

## Learning to Parse DB Queries

### Overview of CHILL

CHILL uses inductive logic programming to learn a deterministic shift-reduce parser written in Prolog. The input to CHILL is a corpus of sentences paired with semantic representations.The parser learned is capable of mapping the sentences into their correct representations, as well as generalizing well to novel sentences.[8] The input to CHILL here is a set of training instances consisting of sentences paired with the desired parses. The output is a shift-reduce parser that maps these sentences into parses. Control-rules are expressed as definite-clause concept definitions. CHILL uses parses of the training examples to figure out the contexts in which each of the inferred operators is and is not applicable. These contexts are then given to an ILP learning algorithm which produces relational concept descriptions. The figure below shows the basic components of CHILL.

The CHILL Architecture

During parsing operator generation, the training examples are given as an input to the CHILL which produces an overly general Parser(it is overly general because it produces many spurious parses from any given input sentence). During Example Analysis, the overly general parser is employed to produce contexts where and where not the parsing operators should be applied. An ILP algorithm learns rules that is able to characterize these contexts. To finish up, the learned control rules are fed back to the overly-general parser which gives the final parser.

### Parsing DB Queries

The query language considered here is a logical form. This was chosen rather than the widely used SQL because the logical form provides a more straightforward mapping from natural language utterances which is integral to the CHILL approach. The database chosen is a United States geography database system. This is chosen because of the availability of an already existing natural language interface called Geobase which we can evaluate against. The Geobase data contains about 800 Prolog facts about the basic information of US geographical topography like information about basic states, about their capitals, population, major cities, major rivers, adjoining states, highest and lowest points in elevation etc.

Some sample questions in natural language and their corresponding database queries are shown below:[7]

 1. What is the capital of the state with the largest population?
2. What are the major cities in Kansas?


### The Query Language, Geoquery

The basics of query representation are the terms used to represent the objects in the database and their relations amongst each other. The basic forms are shown below:[7]

Basic Objects in Geoquery

The objects are country, city, state, river and place. City is represented as a two argument term because there might be a city of the same name in two different states. Example: the city named 'Washington' is present in 31 states out of 50 in the US. To distinguish them we would need to include the name of the state as well. This method also accounts for partial information like it occurs in natural language all the time. In cases where a city is known only by its name, the second term is given an uninstantiated variable. The basic relations for objects is shown below:[7]

Basic Predicates in Geoquery

These are the basic predicates that provide most of the expressiveness of Geoquery. However, to form complete queries we need to incorporate meta-predicates in our system. Meta predicates are different from basic predicates such that they take completely-formed conjunctive goals as one of their arguments as shown below: [7]

Meta-Predicates in Geoquery

The most important meta-predicate is answer with two arguments. This predicate acts as a wrapper for query goals indicating the variable whose binding is of interest (i.e. answers the question posed).

## Experimental Results

### Experiments

A corpus of 250 sentences was gathered by submitting a questionnaire to 50 uninformed subjects. For evaluating, the corpus was divided into training set of 225 examples with 25 held for test cases. Testing criterion was whether the application produced the correct answer to a question or not. Each sentence was parsed to produce a query. This query was then executed to extract an answer from the database. This extracted answer was then compared to the answer produced by the correct query associated with the test sentence. Same answers were scored as a correct parsing, any other difference was taken as a failure. 10-fold cross validation was used to test the average accuracy of CHILL's parser. The figure below compares the average accuracy of the Geobase system when compared to CHILL's parser system.[7]

Geoquery : Accuracy

10 trials were run using different random splits of training and testing data. The performance of the existing system, called Geobase is shown by the dashed line which has been tested on the 25 sentences. CHILL has been trained on random splits of training data and the curve has been plotted for 10 runs. From the curve, we can see that CHILL outperforms the existing system when trained on 175 or more examples. It is important to distinguish the two different modes of failures here. The system could either fail to parse a sentence entirely or it could produce a query which gives an incorrect result. The percentage of spurious parses was 3.2% at 175 examples which dropped down to 2.3% at 200 examples.

The results show that:

1. CHILL has the ability to learn parsers that map sentences into queries without intermediate syntactic parsing or annotation. This holds significant importance because it is a step towards reducing the linguistic expertise needed for NLP applications. A natural extension of this can be seen as automating tasks which are currently done manually example collection examples of the queries produced by database users in the normal course of their work.

2. The results also demonstrate the utility of an empirical approach for a completely natural language application. While we do understand, Geobase is not the most state-of-the-art systems for NLP database query systems but uses a relatively robust parser where many words can simply be ignored and CHILL performing better after training on a relatively small number of examples is encouraging.

## Conclusions

The paper describes an ILP system to learn parsers that map natural language sentences into database queries after training on a corpus of sentences paired with queries. Experiments on U.S geography database show that CHILL's parsers outperform an existing hand-crafted counterpart. These results encourage ideas towards CHILL's ability to learn semantic mappings and highlights the importance of an empirical approach at the level of an NLP application.

Some avenues for future work are:

1. Experiments with much larger corpora and other domains can be tried out.

2. Investigating the increase in performance when using 'manufactured corpora' which allows the the introduction of related sentences hence increasing generality.

## Applications of ILP

Since the output from the ILP systems is quite easy to understand because it is in the form of logic programs, this makes it a good learning approach for scientific applications where is it often more important that the hypotheses discovered by the system are understood by the user. Some other real world tasks that ILP has been applied to are:[9]

Finite Element Mesh Design
Finite element methods are used by engineers to analyze stresses in physical structures. These methods depend on modelling the structures with finite meshes at particular resolutions. Considerable expertise is required to choose the correct resolution to avoid unnecessary computational overheads (too fine a mesh), but to reduce approximation errors (not fine enough). ILP systems were used to determine rules for the mesh resolution of edges in the structure in terms of certain properties of the structure being modelled. The Golem, LINUS, FOIL and CLAUDIEN systems were used and produced novel, understandable rules which achieved around 78% predictive accuracy.
Predictive Toxicology
Drug companies lose millions of pounds by developing drugs which eventually turn out to be toxic to humans. In one of a number of experiments with example drugs split into toxis and non-toxic sets, Progol was used to determine why certain chemicals were mutagenic and others were not. It produced rules which achieved 89% predictive accuracy over a set of 188 compounds.
Generating Program Invariants
Programs written in certain languages can be formally proved to be correct (i.e, perform as per their specifications). To do so, it is imperative to find suitable conditions that always hold at given points in the program. One method is to find suitable conditions within loops, and these conditions are called loop invariants. An ILP system was used to generate such loop invariants and was successful in doing so.

Some of the recent developments that have taken place on Database querying using CHILL acquisition system are as follows:

1. CMU larger corpus on Database Query application

This compares the performance of Geobase, CHILL and WOLFIE on a larger corpus with 471 training sentences (with 221 new sentences).

2. The CHILL software can be accessed from here: CHILL software
3. The Geobase: database of Geography facts can be accessed from here: Geobase facts
4. The subset of geoqueries (250 sentences) as used by this paper can be accessed from here: Geoqueries - 250 sentences
5. Instructions for running Prolog queries for the Geoquery data (system that executes a given query in a logical form) can be accessed from here: Instructions for running Geoquery