Course:CPSC522/Automation of hypothesis generation and testing in science
Automation of hypothesis generation and testing in science
Artificial Intelligence tools such as ontologies can be used in sciences in order to organize knowledge and automate cycles of hypotheses testing and experiments.
Principal Author: Michela Minerva
Paper 1: The automation of science 
Paper 2: Towards automated hypothesis testing in neuroscience 
An important part in scientific experiments is to keep track of all the data and processes; this helps sharing results and allows reproducibility. The introduction of technology in scientific experiments leads to the generation of large volumes of data and metadata throughout all the stages of hypothesis testing cycles. The growing amount of data is heterogeneous and often chaotic, which does not allow one to fully exploit these important materials. Hence, capturing these flows of data and hypotheses in a structured way becomes fundamental, and allows proper recording, further hypothesis testing and data analysis.
This page first describes the robot scientist Adam, a system where hardware and software are integrated together to perform laboratory experiments. Its software uses Artificial Intelligence to generate and execute hypotheses-testing cycles. Later this page describes how the idea of using ontologies to organize the knowledge and do reasoning is exploited in Neuro-DISK, a software system that performs hypotheses testing in neuroscience in an automated way.
This page builds on the concept of ontology (see also the foudational page on ontologies Course:CPSC522/Ontology). These tools are applied to the hypothetico-deductive model in scientific experiments. The paper on the robot scientist Adam makes available additional information and resources regarding software and experimental results.
Introduction and fundamentals
The hypothetico-deductive method represents the most important method used in the scientific areas in order to carry out experiments, generating new discoveries and knowledge. Besides using this method, another fundamental of the scientific process is to record all the data and processes throughout the whole experiments, in order to allow reproducibility.
Technology and automation are becoming ever more present in the processes related to scientific experiments. This allows larger amounts of data to be generated, which lead to new discoveries and knowledge, and is fundamental for the reproducibility of experiments. However, the power of data in scientific experiments is often not fully exploited. Large amounts of scientific knowledge are generally heterogeneous, chaotic and expressed in natural language. Tools coming from Artificial Intelligence such as ontologies can provide the right means to properly structure and give a semantic clarity to this data. This allows not only the effective exchange of scientific knowledge and reproducibility of experiments, but also an active use of knowledge to generate new discoveries or start new experiments. These are the motivations that, outside the specific field of scientific experiments, drove the development of ontologies  and the Semantic Web .
An ontology is a formal and structured representation of concepts, data and relationships in a specific domain of interest. An ontology is able not only to describe data, entities and concepts themselves, but also includes a representation, a formal naming system and a definition of their properties, relations and categories. Using an ontology to represent information allows you to structure data in a well-defined way, to add metadata and several information about the entities and their relationships, and to add a semantic meaning to data that otherwise can appear flat and unorganized. Furthermore, this new structure allows the use of automated systems that perform active elaborations, such as inductive reasoning, problem solving or classification.
The most important example for the use of ontologies is the Semantic Web, the project aimed at transforming the world wide web into a more structured system where a semantic meaning is added to the available resources. In the semantic web documents, such as html pages and multimedia data, are associated with information and metadata. This allows you to properly add a semantic context to these documents, that in the standard world wide web are structured in a heterogeneous and unorganized way. Furthermore, this new format is suitable for specific types of querying and automatic reasoning.
The papers described in this wiki page make use of ontologies in a similar way to the semantic web, although they both apply these tools to specific fields. They use ontologies in order to represent and structure a large amount data and metadata, that can come from heterogeneous sources. Thanks to this new organization, they actively work on data in order to perform reasoning and specific forms of querying. The robot scientist Adam integrates the data and software structure into a hardware-software system in order to perform scientific experiments in a laboratory; the field of application is biology. On the other hand, the project of Neuro-DISK aims at organizing heterogeneous knowledge coming from several sources in a framework that continuously makes use of data to test hypotheses, even after the related experiment is concluded. As a difference to Adam, the system only listens and reacts to data that is generated and collected externally; the field of application is neuroscience.
Robot scientist Adam
The robot scientist Adam is a system that integrates both hardware and software in order to perform cycles of scientific experiments in an automated way; its hardware is designed to work in a biology laboratory. Adam brings automation both in the hypotetico-deductive process adopted during an experiment, and in the recording of the data generated throughout the experiments. Its software leverages on Artificial Intelligence tools to generate and carry on cycles of hypotheses-testing, from the generation of new hypotheses to the execution of testing experiments and recording the results.
In a typical working cycle, a robot scientist like Adam is able to carry out experiments and record the results; it automatically generates hypotheses to explain the available data and observations, designs experiments to test the hypotheses, physically runs these experiments and interprets their result. Throughout this whole process, data and metadata are produced. Adam captures, structures and stores this data, allowing you to track all the results and processes related to its experiments.
In order to test the robot scientist, Adam was applied to identify the genes that encode orphan enzymes in yeast (Saccharomyces cerevisiae) by performing growth experiments. The software blocks that were designed in order to perform this test are:
- A logical model which describes specific knowledge about the metabolism of the yeast; this is expressed in Prolog
- A database with a more general knowledge about genes and proteins involved in metabolism
- Software that, leveraging on databases and bioinformatic software, generates hypotheses about genes encoding orphan enzymes
- Software to deduce and design specific experiments that test the consequences of given hypotheses
- Software that performs the actual tests, controlling the physical parts of the robot and recording the results of the experiments in a database
- Software to analyze data and metadata generated in the previous steps
- Software to elaborate the data in relation to the initial hypotheses; the software makes use of machine learning techniques (decision trees and random forests) in order to decide whether the results are consistent with the hypothesis
- EXPO , an ontology of scientific experiments, for formalizing the process of the experiments. It was developed a customized version of EXPO, called LABORS, expressed in OWL-DL. The data is stored in a MySQL database. The figure on the right shows an example of formalization that leverages on these tools, for the automated study of YER152C function.
Additional information on the software can be found at . It is worth noting that this software and hardware architecture allows the robot to work almost without human intervention. No intellectual human help is required to perform the cycles of hypotheses testing.
Adam worked on 13 orphan enzymes and, formulating 20 hypotheses for genes, was able to confirm 12 hypotheses. At the end of the experiments, the data formalized with LABORS includes 10,000 different research units, it has a hierarchical tree-like structure with a depth of 10 levels and includes connections between the experimental observations and the metadata.
Using a robot scientist like Adam allows you to formalize and record all parts of a scientific investigation. Making data and metadata structured allows scientific research to be reproducible, reusable and easier to explain. Moreover, knowledge generated from an investigation can be used in other experiments to answer different questions or generate further knowledge.
It can be argued that the knowledge generated by Adam is not new knowledge, but it was already inside the available data. The idea that computers cannot create new knowledge originates from Lady Lovelace and was cited by Turing. However, the power of Adam is the ability to exploit data in an automated way, allowing us to discover scientific knowledge that was already present but not trivial to extract.
Neuro-DISK and automated hypothesis testing in neuroscience
The main work of the second paper is Neuro-DISK. It is a system, consisting of a framework and an ontology, that stores data regarding neuroscience researches, continuously processes this data and tests given hypotheses as soon as new data is available.
Hypothesis testing and data analysis are two fundamental points in scientific research. However, the data analysis process in a scientific research is often not easy to interpret and reproduce; moreover, during research, hypotheses are tested once with the available data, and later archived. Data coming from experiments and research is continuously generated, and can be relevant for re-evaluating old hypotheses, since different datasets can produce contradictory results for the same hypothesis. For this reason, it is important to have a framework that performs continuous and automatic hypothesis re-evaluation, based on flows of new data.
Neuro-DISK was developed to work within the ENIGMA consortium , an international collaboration for scientific research and knowledge in neuroscience; it combines several datasets and collects the work of various groups. In this context, a system able to organize data and resources and to capture the hypothesis being investigated is important for the effectiveness of ENIGMA. Neuro-DISK allows identification and retrieval of data in such a heterogeneous environment and can automatically find relevant data for hypotheses or update a hypothesis when new data is available. Neuro-DISK includes an ontology and a framework to organize data, tools and groups within the organization, together with an automated discovery framework that continuously tests the hypotheses by executing scientific workflows. The data available to Neuro-DISK thanks to ENIGMA are constantly growing in quantity, and come from many different sources, which causes them to be heterogeneous; data must be collected from different databases within the consortium in order to generate a full analysis and hypothesis generation.
ODS (Organic Data Science framework) is used in order to organize and share data within ENIGMA. ODS uses W3C standards such as RDF and SPARQL to represent and structure the content, and it is built on Semantic MediaWiki. The central point of this system is the wiki page; each wiki page represents a specific resource and contains information regarding the properties of the page’s class. Each resource can be associated with metadata. This results in the knowledge base ENIGMA-ODS; this knowledge base is structured according to the ENIGMA Ontology , which differs from standard ontologies because it includes specific tools for representing ENIGMA’s datasets, work groups, and tools together with their relationships. The system supports SPARQL queries for data retrieval and modification.
The DISK  framework performs automatic analysis of dynamic scientific data and is used to test and revise hypotheses. It operates on data based on its description expressed in ODS metadata. As new data becomes available, DISK re-evaluates hypotheses for which the new data can be significant. It tracks the revision of hypothesis and can start new analyses based on the availability of new flows of data. This system was extended to work on the neuroscience research field, resulting in the Neuro-DISK framework; it has access to data within ENIGMA by integrating ENIGMA-ODS.
The DISK framework relies on a library of Lines of Inquiry (LOI) to manage hypothesis. Each LOI is associated to:
- a hypothesis pattern
- a relevant query data pattern
- a set of workflows to process data
- meta-workflows to combine their result and generate new hypothesis or confidence values.
In Neuro-DISK's workflow for hypothesis evaluation, a user defines the hypothesis of interest. If it matches the pattern of a LOI, the system retrieves data according to the relevant query data pattern and passes it to the workflows for execution. Then, meta-workflows use the results of workflows to evaluate the initial hypothesis.
In order to test the framework, the following hypothesis was tested: “is the effect size of APOE4 genotype on hippocampal volume related to the age of the cohort used for the experiment?”. The hypothesis matches the hypothesis pattern of a LOI. DISK triggers a SPARQL query on the ENIGMA-ODS knowledge base according to the LOI’s data query pattern; the result of this process is the set of identifiers (URLs) of the entities satisfying the query. These URLs become the input of the workflows associated to the LOI, that perform computations in order to test the hypothesis. The experiment confirmed the initial hypothesis; it discovered a negative association between age and the effect of the APOE4 gene on the hippocampal volume.
Discussion and future works
Both papers show the successful use of Artificial Intelligence tools, in particular ontologies, applied to hypothesis-based processes in science. The robot scientist Adam uses ontologies to store and structure the results of biology experiments, within automated cycles of hypothesis generation and testing. On the other hand, Neuro-DISK leverages on them to perform automatic hypothesis testing in the field of neuroscience, in particular within the ENIGMA consortium, and allows re-testing of previous hypothesis when new data becomes available.
These works show the power of ontologies, both for representing information and for performing active data processing. They can be exploited in order to automate intellectual tasks that are typical of humans, such as the process of generating hypothesis and testing them through experiments; in the example of Adam, this led to discovering new scientific knowledge. They can also be used to keep hypotheses alive, allowing hypotheses to be re-evaluated and updated when new data becomes available, as shown in Neuro-DISK. The use of ontologies and their related tools can go beyond the applications in the Semantic Web; they can be applied to specific fields of science, helping humans to organize, process and even generate scientific knowledge.
- King, Ross (2009). "The automation of science". Science. 324: 85–89.
- Garijo, Daniel (2019). "Towards Automated Hypothesis Testing in Neuroscience". Heterogeneous Data Management, Polystores, and Analytics for Healthcare: 149–257.
- "w3c ontology".
- "CPSC 522 ontology".
- "w3c semantic web".
- Soldatova, L.N. (2006). "An ontology of scientific experiments". J R Soc Interface. 3: 795.
- Soldatova, L.N. (2006). "An ontology for a Robot Scientist". Bioinformatics. 22: e464.
- Horrocks, Ian (2003). "From SHIQ and RDF to OWL: The Making of a Web Ontology Language" (PDF). Web semantics. 1: 7.
- King, Ross (2009). "The automation of science supporting materials".
- Turing, Alan (1950). "Computing Machinery and Intelligence". Mind. 236: 433.
- Thompson, P.M. (2014). "The ENIGMA consortium: large-scale collaborative analyses of neuroimaging and genetic data". Brain Imaging Behav. 8: 153–182.
- Thompson, Paul et al. (2019). ENIGMA and Global Neuroscience: A Decade of Large-Scale Studies of the Brain in Health and Disease across more than 40 Countries. https://www.researchgate.net/publication/334232733_ENIGMA_and_Global_Neuroscience_A_Decade_of_Large-Scale_Studies_of_the_Brain_in_Health_and_Disease_across_more_than_40_Countries
- Gil, Y. (2012). "Organic data publishing: a novel approach to scientific data sharing". Second International Workshop on Linked Science: Tackling Big Data (LISC). line feed character in
|journal=at position 49 (help)
- "Organic Data Science framework organization".
- Jang, M. (2017). "Towards Automatic Generation of Portions of scientific Papers for Large Multi-Institutional Collaborations Based on Semantic Metadata". CEUR Workshop Proceedings. 1931: 63–70.
- "Enigma ontology documentation".
- Gil, Y. (2016). "Automated Hypothesis Testing with Large Scientific Data Repositories" (PDF). Proceedings of the Fourth Annual Conference on Advances in Cognitive Systems.
- Gil, Y. (2017). "Towards Continuous Scientific Data Analysis and Hypothesis Evolution" (PDF). Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.
Adam is one of the first successful robot scientists. Other robot scientists have been developed from the basis of Adam, applied to scientific research in different fields. An example is Eve, a robot scientist that brings automation in the drug development process. The paper can be found here.