Course:LIBR557/2020WT2/cranfield experiments

From UBC Wiki

Cranfield Experiments

The Cranfield experiments were a series of empirical experiments designed to (a) determine the relative efficacy of existing information search systems, and (b) develop a methodology for designing and evaluating new search systems (Cleverdon, 1967; Voorhees, 2001). This methodology consisted of two measures of efficacy which have become central to the discipline of information retrieval: recall (the proportion of relevant documents that are retrieved) and precision (the proportion of documents retrieved that are deemed to be relevant to the original search query) (Cleverdon, 1967). The Cranfield Experiments, which were implemented over two main stages now known as Cranfield I and Cranfield II, were overseen by Cyril W. Cleverdon, Librarian at the  College of Aeronautics in Cranfield, UK (Harman, 2019).

History

It is widely accepted that the formal empirical evaluation of information retrieval systems began with the Cranfield Experiments, which were carried out between 1958 and 1966 (Robertson, 2008). Up until this point, information retrieval methodology had been primarily modelled upon library classification systems such as the card catalogue – which depended upon the expertise of librarians to carry out the actual retrieval (Robertson, 2008). The main advancement of the Cranfield Experiments was in shifting the focus toward the information needs of end users and the efficacy of retrieval systems in meeting these needs (Voorhees, 2001).

Cranfield I

The goal of the first round of Cranfield Experiments (1958-1962) was to compare the relative utility and efficacy of four distinct indexing methods: the Universal Decimal Classification System, an alphabetical subject catalogue, a faceted classification scheme, and the Uniterm System of Co-ordinate Indexing (Cleverdon, 1967; Harman, 2019; Robertson, 2008; Voorhees, 2001).

This evaluation was carried out in Cranfield I using a ‘source document’ method, in which a reference document was first identified, and the author of that document was then asked to develop a search query that could be satisfactorily resolved by that document (Harman, 2019; Robertson, 2008; Voorhees, 2001). This combination of the set of available documents, the expression of an information need, and the evaluation of relevance constitutes the test collection for the experiment. Each of the four methods was then used to search for the target document in turn, and results were compared. Little difference was found between the methods, but the failure rate for all methods was high, at around 35% (Harman, 2019).

Issues

In the interest of experimental simplification, the original Cranfield experiments made the following three assumptions about the relationships between users, information, and retrieval:

1.     that the relevance of a result is a binary property, determined by similarity to the user’s  expression of their information need – i.e. the user query;

2.     that the relevance of results to an entire user population is reducible to a single (expert) user’s judgement of relevance;

3.     that all possible results of an information search are known ahead of time (i.e. that the list of potential results is exhaustive) (Voorhees, 2001).

These shortcomings can be generalized under the overarching umbrella of a failure to distinguish between mechanisms promoting precision and those promoting recall (Cleverdon, 1967).

Cranfield II

The second round of Cranfield Experiments (1962-1966) implemented a modified approach to the ‘source document’ method, in which all of the documents determined to resolve the target query were located, and the efficacy of different search systems in producing these documents was then evaluated (Cleverdon, 1967; Harman, 2019; Robertson, 2008; Voorhees, 2001). This model of the experiment, however, evaluated the methodology of search systems at a more fundamental level. Only two indexing methods were tested – automatic and manual – and the manual system was further divided by indexing language: single terms, simple concepts, controlled terms, and terms found in abstracts and titles were each tested in turn.

The documents identified through these tests were reviewed by the same expert user who had developed the search query, who evaluated their relevance on a scale of 1-5 (with clear definitions for each rating provided) (Harman, 2019). Crucially, this analysis of relevance recorded recall and precision as separate results, which allowed the relationship between these two parameters to be experimentally evaluated for the first time. The Cranfield II tests determined that single terms produced the most relevant results, and confirmed that an inverse relationship exists between recall and precision, by which a higher level of coverage of documents results in a lower level of match between each produced document and the query submitted (Cleverdon, 1967; Jones, 2004) . This insight, in turn, helps to ground the contemporary construction of inverted indexes, which are constituted by two indexed lists: a list of terms and a list of resources in which those terms can be found.

Legacy

The Cranfield Experiments succeeded in developing an approach to the evaluation of information retrieval systems that has remained central to much of the empirical work in the area to this day, continuing to serve as the basis of such contemporary evaluative projects as the Text Retrieval Conference (TREC) and the Cross-Language Evaluation Forum (CLEF) (Voorhees, 2001). Moreover, the Experiments succeeded in establishing two main measures of search efficacy which serve as the definition of successful information retrieval and also to ground the design and evaluation of new information retrieval systems: precision and recall (Jones 2004).

References

Cleverdon, C. 1967. The Cranfield Tests on Index Language Devices. Association for Information Management (ASLIB) Proceedings 19(6): 172-194.

Harman, D. 2019. Information Retrieval: The Early Years. Foundations and Trends in Information Retrieval. 13(5): 425–577.

Jones, K. S. 2004. A Statistical Interpretation of Term Specificity and its Application in Retrieval. Journal of Documentation 60(5): 493-502.

Robertson, S. 2008. On the History of Evaluation in IR. Journal of Information Science 34(4): 439-456.

Voorheis, E.M. 2001. The Philosophy of Information Retrieval Evaluation. Cross-Language Evaluation Forum (CLEF) Lecture Notes in Computer Science (LNCS) 2406: 355-370.