# Course:CPSC522/Ontology Extraction

Editing note: please note that this is not a polished final draft; and, as such, the wording may need to be revised and refined.

## Title

This entry describes and evaluates the incremental progress claimed by Gaeta et al in the research on ontology extraction.

Author: Shunsuke Ishige

## Abstract

Ontology extraction is an attempt to learn concepts and their interconnections essential in a given domain by analyzing information sources such as documents of diverse formats. The resulting system of concepts in the form of a set of terms and their definitions/meanings can be utilized in knowledge representation and sharing. Ontology extraction might resolve problems with manual ontology construction ascribable to the amount and complexity of the work. This entry primarily concerns a paper by Gaeta et al that aims to improve on a previous similar work by Maedche and Staab by proposing an ontology extraction system that is designed to be eventually automatic.

### Builds on

The primary paper discussed in this entry is Gaeta et al [1], which builds on Maedche and Staab [2]. Please see the course Wiki entry on Ontology for explanation of ontology per se. Both papers concern the overall frameworks of comprehensive systems of ontology extraction. As a result, although in this context, ontology learning is based on natural language texts, detailed knowledge of language parsing is not necessary. Skimming the first few chapters of this slide set might help gain a high level understanding. Some knowledge of WordNet, a lexical database commonly used in natural language text parsing, is assumed. Here is the link to the official website. To grasp roughly how WordNet helps in processing text information, browsing this slide set should help.

### Related Pages

To be added, if any found.

## Content

### Preliminaries

This section provides a very brief characterization of ontology, and introduces the specific research topic the authors of the two papers are concerned with.

#### Ontology in Information Technologies

Ontology is a set of terms and their definitions that systematically specify things and their relations in representing and sharing knowledge [1]. As reflected in the wide range of ontology formats, including database schema and formal languages [2], use of such sets are far-reaching. This extent and the role of representation -- the fundamental basis of information storage, sharing, and reasoning -- suggest potential impact of choice of terms and their definitions. It can hinder agents, causing confusion due to poorly coordinated systems of terms and their definitions. Alternatively, it might realize a "high fidelity semantic communication" for agents using information in an efficacious manner [2].

#### Levels of Representation: Domain Ontologies

The desired level of abstraction for representation might vary; so do the terms and their definitions used in the representation. For example, in the domain of agriculture, we might want to represent, for information storage or reasoning in machines, basic domain specific things such as various types of crops -- orange, carrot, and so on -- together with general classes of products those crops belong to -- citrus, root vegetable, and so on. Alternatively, we might be more interested in abstract notions, such as temporal transition or spatial expansion, that straddle many specific domains. Both papers discussed in this entry concern the former, ontologies specific to each concrete domain.

#### Need for Learning Ontology: the Research Problem in the Papers

##### Motivation

The particular aspect of domain ontology the authors of the two papers are concerned is efficiently constructing ontologies, processing documents and systematically extracting relevant terms and concepts used in a given domain. Smith's apt characterization of ontology as a "dictionary" [1] helps understand the complexity of the task and motivate the need for research. Aside from the sheer volume, even within the same domain, stakeholders might not have the same needs and interests for their applications and systems and consequently opt for their own standards. But a dictionary is supposed to serve as a basis of knowledge representation and sharing for all, thereby facilitating communication and information use by agents. In particular, Maedche and Staab are motivated by the need to "overcome this knowledge acquisition bottleneck through learning and discovering conceptual structures from texts" [3]. Gaeta et al echo this observation as they point out that "in large complex application domains, this task [of constructing commonly agreed-upon ontologies] can be lengthy, costly, and controversial... [and so] finding (semi-)automatic algorithms to extract ontology concepts ... represents an important [research] activity" [4].

##### Scope and Nature of the Problem

Ontology extraction draws on diverse sub-fields of computer science: "[o]ntology learning puts many research activities [such as in computer linguistics, information retrieval, machine learning, databases, and software engineering] -- which focus on different inputs but share a common domain conceptualization -- into one perspective" [5]. In particular, with a summary of relevant literature, Maedche and Staab point out that "[u]ntil recently, ontology learning -- for comprehensive ontology construction -- did not exist" [5]. So, it is a framework, in which to utilize works from multiple sub-fields for the new purpose, that these authors are concerned with. Gaeta et al also share this interest as they say that "most approaches have 'only' considered one step in the overall ontology engineering process, for example, the acquisition of concepts, the establishment of a concept taxonomy, or the discovering of conceptual relationships, whereas one must consider the overall process when building real-world applications" [4].

### Selective Presentation of the Work by Gaeta et al

This section describes the approach of Gaeta et al, focusing on the contributions they claim to build on previous works, which include the similar work by Maedche and Staab. The latter is briefly presented first to provide a more detailed context for the primary paper.

Editing Notes

• Please refer to the their papers for the schematic diagrams of the respective systems, which help clarify the fundamental difference in their approaches; those figures are not included here out of concerns over copyright infringement. (A similar remark applies to the description of the experiment results in the next section.)
• Given that these papers do not claim novel algorithms in Natural Language Processing as primary contributions, NLP jargon is avoided as much as possible in the following exposition, so as to facilitate reading (and to avoid conflating ontology extraction with NLP, which would conflict with the interest and goal stated by the authors themselves). Maedche and Saab [5] provides an extensive list of works in which technical details that might find application in their system are discussed; the interested reader should refer to those sources.

#### Previous Work: the Semi-automatic Model by Maedche and Staab

In their paper [3], Maedche and Staab introduce Text-To-Onto, concrete implementation of a system that utilizes information in texts to build ontologies. In the following, the basic principle of ontology extraction is first explained, following the exposition of the authors. Subsequently, the structure of the system they constructed is briefly described.

##### From Texts to Ontology

Regarding the relation between texts and ontology, Maedche and Staab state that "[t]hrough the understanding of the text [,] data [sic] is associated with conceptual structures and new conceptual structures are learnt from the interacting constraints given through language" [3]. The authors' explanation of this fundamental idea underlying ontology extraction that leverages the structures, meanings, and usage of sentences might be presented in terms of the following major steps:

1. parsing a given text into syntactical components of sentences;
2. relating, through an existing lexicon, pertinent syntactical elements to the appropriate definition/meanings/concepts in the tentative ontology;
3. finding statistical relationships among the syntactical elements; and
4. adjusting the ontology on the basis of the learnt quantified information [3, 5].

There are two things to note about this basic procedure. First, because of the "map[ping]" [3] in the step 2, the information about words in sentences are associated with the corresponding definitions/meanings/concepts in a given ontology. Hence, the statistical study in the step 3 can provide useful information about the ontology: what concepts should be there and how they should be interrelated, in light of the way language expressing them is structured. The basic idea here seems somewhat similar to isomorphism in abstract algebra (to be sure, this map is not bijective): the construction of a map allows transferring information found in one space to another -- from texts to ontology. Second, different terms may be mapped to the same definition/meaning/concept such as in the case of synonyms denoting the same thing.

The following example of the above connection between texts and ontology is based on that of the authors, who used the original to illustrate their use of a data-mining algorithm [3]. In the domain of agriculture, we might learn that in our ontology there should be correlations between acidity and fruits such as lemon and lime, based on the observation of correlations among the corresponding words, "acidity", "lemon", and "lime", in parsed texts. Also, the statistical information might show an associated correlation between acidity and citrus in the ontology.

##### System Structure of Text-to-Onto

Text-to-Onto of Maedche and Staab uses GATE, a Java framework for natural language processing, and comprises six main components. The following is a brief summary of the system described in their paper [3].

• Text and Processing Management: The system takes natural language texts, document processing methods, and learning algorithms as the input.
• Text Processing Server: The multi-module component parses the text syntactically and semantically, according to the specified processing method, accessing two lexical databases. In particular, use of the database for words of interests in the given domain allows the mapping described above. The output is given as XML files.
• Lexical DB's: The databases of the system are used for general syntactic parsing and for the mapping.
• Learning and Discovering: The system processes the XML files from the preceding stage, according to the algorithm specified in the first stage. The statistical analysis mentioned in the preceding section is carried out in this stage.
• Ontology Engineering Environment: The component provides the editor for the user to manage ontology learning. In particular, he can determine whether to incorporate the results from the Learning and Discovering stage.

In terms of the system architecture, the important thing to note is that Maedche and Staab builds their system on the assumption of user interactions. In fact, they say, "We have to emphasize that we do not consider fully automatic ontology acquisition from text as realistic, so we support the knowledge engineer as much as possible with graphical user interfaces and visualization of discovered conceptual structures" [3]. Their system is by design semi-automatic.

#### Approach of Gaeta et al

Gaeta et al also implement a multi-module ontology extraction system that uses GATE. However, they explicitly cite a design that is aimed for eventual automation as a main contribution, and test their system by a case study in education settings as well as by comparison with manually constructed ontologies [4]. The following is a summary of their description of the ontology extraction system [4]. By the very nature of the complexity of ontology extraction from texts that encompasses not only natural language parsing but also semantic analysis, their description covers a wide range of topics. Choice of details for the purpose of the exposition in this entry is made so that the summary bring out main differences in system design between Gaeta et al on the one hand and Maedche and Staab on the other.

• Text Importing and Term Extraction: the multi-module component parses the text input that can be in several formats. The output in all cases is in the XML format. A noticeable feature of their text importing process is to capture the information about the file organization and document structural content in order to utilize it later in finding semantic relationships among words that occur in the structured context. Also, although their system uses the general purpose lexical database WordNet, the authors do not mention an additional database storing domain specific words of interest. In fact, the process of the extraction of pertinent words from the imported documents is described to be capable of selecting words of interest in the given domain, using heuristics that estimate the relevance.
• Preprocessing: the module determines which of the two semantic analysis approaches to be used by the next stage. The top-down method applies when the system can utilize available organization in the original document. The bottom-up method enables processing even in the other cases.
• Semantic Interpretation/Ontology Construction: this component corresponds to the mapping and statistical analysis described above. The authors consider two situations of construction, namely creation of a new ontology and extension of an initial ontology. In either case, the goal of the stage is to find a hierarchical organization comprising the concepts expressed by the relevant words. In particular, the authors builds the conceptual organization by the basic relation among concepts, called HP (HasPart). Regardless of the approaches, top-down and bottom-up, WordNet is utilized to refine an initial approximation of ontology by combining synonymous terms and finding relationships among the underlying WordNet synsets. However, in the former approach, the organization of the relevant words in the original structured context of the input directory and documents helps determine HP relations; whereas, in the latter approach, statistical information about corresponding words in texts helps find HP relations.
• Harmonization/Fragmentation (purportedly for optional processing): this component concerns combining multiple ontologies. It apparently does not have a corresponding part in the version of the system of Maedche and Staab [3] cited in the previous works section. Also, the authors themselves acknowledge that user interactions are indispensable for this component, so that it does not lend much support to their main claim about eventual automatic extraction. So, the component is skipped for the purpose of the exposition in this entry.
• Refinement and Validation of Created Ontology: the component searches for potential errors in the constructed ontology, for instances, by an expected balance of hierarchical tree branches in depth and an assumed even distribution of linguistic references to ontology parts. However, the authors acknowledge that the user needs to do the actual error correction on an editor, before the ontology is exported in OWL.

The most noticeable difference in system design is that their implementation of ontology extraction is not centered around user interactions. For one thing, elimination of a purpose-build lexical database for a given domain might help reduce maintenance costs. Also, even though it is not completely clear from their description, what degree of human monitoring is necessary to actually operate the system in a more or less seamless fashion, at least the authors do not include an central editor as a major component of their system. One might question, however, how this design can use in a more or less automatic fashion accumulated data to revise, rather than extend, an ontology under construction. It seems possible that in light of additional data, certain existing relationships among concepts need to be adjusted or corrected. Presumably, the process results in errors, which are to be corrected by the user in the last stage.

#### Verification

This section briefly presents the empirical evaluation of the approach by Gaeta et al in the paper. The summary is followed by a critical discussion of their work in light of the evidence.

##### Case Study

As part of verification of their work, the authors conducted a case study in computer-assisted learning. The following is a gist of their report in the paper [4]. Essentially, the role of ontology extraction in this setting is to construct a learning plan based on hierarchical or ordering relationships among concepts pertinent to learning the subject. To this end, the authors use several relations, such as HasPart and HasResource for hierarchy and IsRequiredBy for ordering. For instance, a top level concept, which is to be mastered by the student, might require prerequisite understanding of lower level concepts related by HasPart. Those preliminary concepts, in turn, might have to be ordered by IsRequiredBy for efficacious learning. Also, available learning resources need to be related to those lower level concepts by HasResource.

• Subjects: 28 students of enterprise management from 7 different institutes, who participated voluntarily.
• Procedure: in computer-assisted learning sessions (the authors say a "phase", with the number of sessions being unclear in this paper), Only the experiment group (the number of subjects in this group and how they assigned subjects to two groups are not stated in this paper) used the additional features provided by the learning plan from ontology extraction, as described above. The pre- and post- experiment performance of the two groups are measured by tests on the scale of 1-10 and categorized into three levels of mastery.
• Results (the percentages read from the graph they provide in the paper)
• For the control group,
• pre-experiment: ${\displaystyle \sim 40\%}$ in the low level, ${\displaystyle \sim 40\%}$ in the medium level, and the rest in the high level
• post-experiment: ${\displaystyle \sim 20\%}$ in the low level, ${\displaystyle \sim 70\%}$ in the medium level, and the rest in the high level
• For the experiment group,
• pre-experiment: ${\displaystyle \sim 60\%}$ in the low level and the rest in the medium level
• post-experiment: ${\displaystyle \sim 60\%}$ in the medium and the rest in the high level
##### Comparison with Manually-constructed Ontologies

For verification, the authors also quantitatively compare their system with conventional manual construction of ontology in their products. The following summarizes their description [4]. The context they made the comparison is related to education as in their case study; in particular, they constructed similar ontologies in several different subject areas of enterprise management. The metrics they used are defined as follows. With ${\displaystyle Ref_{O}\equiv \{terms\in manually\ constructed\ ontology\}}$, and ${\displaystyle Res_{O}}$, the corresponding set for the ontology constructed by their system, ${\displaystyle Precision\equiv {\frac {|Ref_{O}\cap Res_{O}|}{|Res_{O}|}}}$ and ${\displaystyle Recall\equiv {\frac {|Ref_{O}\cap Res_{O}|}{|Ref_{O}|}}}$. That is, these are the ratios of the cardinality of the relevant sets. In particular, the elements in the intersection are included in both sets. The import of these quantities are "a measure of exactness or fidelity" and "a measure of completeness" [4], respectively. They further define another metric, ${\displaystyle F_{\beta }\equiv {\frac {(1+\beta ^{2})(Precision*Recall)}{\beta ^{2}(Precision+Recall)}}}$, i.e. the weighted harmonic mean. The table below combines the two tables for the data and results found in the authors' report.

Domain Number of documents Precision Recall
Art 46 0.78 0.48
Database 102 0.82 0.52
Human Resources 68 0.65 0.44
Information Systems 126 0.83 0.49
Problem Solving 26 0.96 0.59
Tourism 85 0.77 0.62
User Modeling 110 0.89 0.53

(From TABLE 1 and TABLE 2 of [4])

They report ${\displaystyle {\overline {F_{0.5}}}=0.73}$.

#### Brief Critical Discussion

To evaluate the paper by Gaeta et al, their central claim is a viable system design that leads to automation, which they evaluated with the experiments. However, overall they fall short of offering convincing argument and evidence.

To be sure, certain merit of the paper needs to be acknowledged. For one thing, they designed their system with the explicit eventual goal of automation, which departs from the predeocessors' fundamental approach. Some features they reported, such as estimating relevancy of terms from the parsed texts without referring to a purpose-built lexical database and finding potential errors in constructed ontologies automatically, seem to reflect the design principle. Moreover, they showed that their implemented system at least yields ontologies and reported an application in education that might lead to eventual practical use. Also, the two experiment may be said to be designed to augment each other: simple ratios of overlapping terms may not necessarily show that those terms are related in a meaningful manner; and the case study alone may not allow quantitative assessment of the system performance compared with human constructed ontologies.

However, there are several short comings in their work. Most importantly, the experiments they designed and conducted fail to provide strong empirical support in practice.

Regarding the case study, it should be acknowledged that even though the sample size is small, the authors at least included experiment subjects from several different institutes. However, their report of the case study in this paper is terse and seems to lack accuracy. In particular, it is not clear why the the control and experiment groups appear so different with respect to their initial mastery of skills. It might be due to the small size of the sample and random assignment to the respective groups, but the authors do not state the assignment method they used.

As for the other experiment they conducted, it might also raise questions. For one thing, the low rates of recall indicate that the ontologies constructed by their system failed to include about half of the terms the engineer manually constructing ontologies did, even though the precision data show that the terms their system included are well over half of the times in the manually constructed counterparts. Also, the authors do not mention the size of each document. It is not clear why for the domain of Problem Solving, the application attained such a high rate of precision and relatively good recall score for the smallest number of documents. Furthermore, a more general issue is that presumably the preparation of the ontologies used in the comparison involved aid of some software at least in some stages, which could influence similarities in the selection of terms, perhaps due to the algorithms used. Still another questionable point is that the data they provide is about intra-study comparison. Although similar data might not be readily available, comparison with the performance of other ontology extraction systems would help establish their claim.

Further questions concerning both experiments might be asked. To what extent, did they use harmonization and other user interactions intensive features to generate the results they reported in the experiments? A related point would be that since presumably a produced ontology needs incremental adjustment and expansion, the information as to the exact procedure and time incurred would have to be provided. Also, they did not design the experiments so that the effect of each feature or algorithm they used can be individually controlled and assessed, which might have provided not only evaluation of the important system components but also of the system architecture per se.

The authors are not very consistent in the the conclusion, failing to reiterate the contribution they initially claimed about automation and discuss the extent that they actually achieved the goal. In the end, it seems not very clear what their contributions actually are.

## Annotated Bibliography

[1] Smith B. Ontology. Blackwell Guide to the Philosophy of Computing and Information. Luciano Floridi (ed). p.p.155-166. Oxford UP, Blackwell, 2003. https://philpapers.org/archive/SMIO-11.pdf

[2] Uschold M., Gruninger M. Ontologies and Semantics for Seamless Connectivity. SIGMOD Record, Vol. 33, No. 4, p.p 58-64, December 2004. https://dl.acm.org/citation.cfm?id=1041420

[3] Maedche A., Staab S. The Text-To-Onto Ontology Learning Environment. Software Demonstration at ICCS-2000-Eight International Conference on Conceptual Structures, 2000. https://s3.amazonaws.com/academia.edu.documents/30556923/10.1.1.67.7639.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1551393886&Signature=P4Y9A4qWOc%2BYjil%2BOkS8w2%2Fg7cY%3D&response-content-disposition=inline%3B%20filename%3DThe_text-to-onto_ontology_learning_envir.pdf

[4] Gaeta M., Orciuoli F., Paolozzi S., Salerno S. Ontology Extraction for Knowledge Reuse: The e-Learning Perspective. IEEE Transactions on Systems, Man, and Cybernetics -- Part A: Systems and Humans, Vol 41, No. 4, July 2011. https://ieeexplore.ieee.org/abstract/document/5765718

[5] Maedche A., Staab S. Ontology Learning for the Semantic Web. IEEE Intelligent Systems, Vol. 16, Issue 2, Mar-Apr 2001.https://ieeexplore.ieee.org/abstract/document/920602/citations?tabFilter=papers#citations