Course:LIBR557/2020WT2/precision and recall

From UBC Wiki


Precision and Recall

Precision and Recall are values with which the performance and effectiveness of an information retrieval (IR) system are evaluated in system-centric testing. Pure Precision and Recall and its combined F-Measure “are computed using unordered sets of documents” (Manning et al., 2008, p.145) of a heterogeneous nature. Test sets of documents and queries, like the TREC series, have been developed for use with evaluative tools (Manning et al., 2008) to standardize testing, a tradition started by Cleverdon, Mills & Keen in the Cranfield Experiments (Smucker, 2011). A higher number of documents and queries in a set is essential to obtain stable and meaningful results (Buckley & Voorhees, 2000). Non-relevant or missed relevant items during retrieval are considered errors and defects in information system design (Araghi, 2005). A high Precision and high Recall are the desired results.

Precision

Precision is defined as “the fraction of retrieved items that are relevant” out of the total documents retrieved by an IR system (Manning et al., 2008, p.142).

This value can be calculated with the equation:

# of relevant and returned items

total # of retrieved items                

(Guns et al., 2012; Manning et al., 2008)

Variations on the Precision Value

Precision variations are used for ranked items in testing (Manning et al., 2008) and include:

i) Mean Average Precision (MAP) or Average Precision fixed value: “[T]he average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved” divided by total relevant documents (Manning et al., 2008, p.147).

ii) Precision at (fixed value): The evaluation of the Precision of a fixed number of results. Buckley & Voorhees observed that this can be an “inherently unstable” value with a higher error rate that does not average well (2000, p.38; Manning et al., 2008).

iii) Interpolated Precision: the calculation of Precision at exact Recall values (Manning et al., 2008).

Recall

Recall is defined as “the fraction of relevant documents that are retrieved” (Manning et al., 2008, p.143) out of total relevant documents held in an information system.

This value can be calculated with the equation:

# of returned and relevant items

total # of relevant items                   

(Guns et al., 2012; Manning et al., 2008)

The Precision and Recall Relationship

Information scientists have observed that an increase in either Precision or Recall results in a corresponding decrease in the other variable, thus their relationship is classified as Inverse, or complementary (Buckland and Gey, 1994; Cleverdon, 1972; Guns et al., 2012; Manning et al., 2008; Smucker, 2011). A high or 1.0 Recall value would recall all documents in an information system; however, not all documents would be relevant to the information need, thus Precision would be low. Conversely, if the number of relevant documents recalled by a system returned a high or 1.0 Precision value, Recall would be low (unless all documents were relevant) (Manning et al., 2008).

The Role of Relevance

Items for retrieval in lab testing are judged on a binary of either relevant or irrelevant to the information need, a critical component of Precision and Recall values. Buckley & Voorhees identified Relevance as an integral part of a test collection, which “consists of a set of documents, a set of topics, and a set of relevance judgments…[t]he relevance judgments specify the documents that should be retrieved in response to each topic” (p.34).

The binary relevance table:

Item Relevant Not Relevant
Retrieved tp (true positive match) fp (false positive match)
Not Retrieved fn (false negative match) tn (true negative match)

A True Positive is Relevant and returned; a False Positive is returned, but not Relevant.

A False Negative is Relevant but not retrieved; a true negative is not Relevant and not retrieved.

The value of Precision is expressed as: "(tp/tp + fp)"

The value of Recall is expressed as: "(tp/tp + fn)"

(Manning et al., 2008, p.143)

The F-Measure: Where Precision & Recall Meet

The F-Measure, or "F-Score", is “the harmonic mean of precision and recall” (Guns et al., 2012, p.1171)

Thus “if either precision or recall is low, F will be low as well” (Guns et al., 2012, p.1172).

This measure is represented by the equation:

"F = __2___=  _2PR_"

  1/P+1/R     P+R

(Guns et al., 2012, p.1171)

Challenges & Future Work

Researchers have agreed there can be a large degree of uncertainty when assessing Precision and Recall in evaluation and practice. This can be due to:

Questions of Relevance

As a value integral to calculating Precision and Recall, there has been much controversy over the veracity of these evaluative measures due to the instability of the term ‘relevance’ and recognition of its subjectivity in practice (Lesk & Salton, 1968; Manning et al., 2008). In their study of 1200 documents, Lesk & Salton determined that “large scale differences in the relevance assessments do not produce significant variations in average recall and precision. It thus appears that properly computed recall and precision data may represent effectiveness indicators which are generally valid for many distinct user classes” (1968, p.343). However, many decades later, Manning et al. argued that, in practice, “judgments of relevance are subjective...[a]ny results based on one collection are heavily skewed by the choice of collection, queries, and relevance judgment set; the results may not translate from one domain to another or to a different user population” (2008, p.153). The controversy over the exactitude or flexibility of 'relevance' across IR systems in evaluation and practice continues to be a debated issue.

Precision and Recall Values in Everyday Use

Precision and Recall values can be positively or negatively impacted by the use of different search strategies and some databases. For example, a balanced use of proximity operators (Keen, 1992, in Proximity Operators, 2021) and the supply of query suggestions in search programs have been shown to improve values, but other strategies such as Boolean and the use of short words in queries can negatively impact the Precision and Recall of search results. Furthermore, traditional Precision and Recall evaluations do not work well with all knowledge organization systems, such as Folksonomies.

Indexing Strategies

In non-testing situations, users can be uncertain as to how to express their information need and how documents in a system have been indexed--or whether relevant documents are held at all. Thus researchers have recommended better and higher indexing--critical to document retrieval--as an aid to the precision-recall relationship (Araghi, 2005; Cleverdon, 1972; Smucker, 2011). Araghi proposed utilizing terms familiar to users and ‘sufficient’ use of cross-indexing (2005, p.43). Furthermore, indexers should seek a balance of generality and specificity in term selection: terms that are too broad or ambiguous return a high document Recall with a low Precision, while too narrow terms result in low Recall and missed relevant items (Araghi, 2005; Smucker, 2011).

Holt & Miller (2009) explored the use of “[s]emantic indexing and entity recognition” (p.13). By grouping or associating documents in “internally consistent sets of references...[a] particular set of references then describes an entity, and an entity is described by a set of references” (2009, p.13). Thus "[a] collection of entity information can be used as a statistical thesaurus for the purpose of query expansion” (2009, p.14) and allow one document to lead to a set which may contain what the searcher is actually looking for, but unable to express, aiding the Precision and Recall of their search.

References

Araghi, G.F. (2005). Major Problems in Retrieval Systems. Cataloging & Classification Quarterly 40(1), 43-53. DOI: 10.1300/J104v40n01_04

Buckland, M & Gey, F. (1994). The Relationship between Recall and Precision. Journal of the American Society for Information Science (1986-1998) 5(1), 12-19. Retrieved from https://search-proquest-com.ezproxy.library.ubc.ca/docview/216897072?accountid=14656&pq-origsite=summon

Buckley, C. & Voorhees, E.M. (2000). Evaluating Evaluation Measure Stability. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, 33-40. https://doi-org.ezproxy.library.ubc.ca/10.1145/345508.345543

Cleverdon, C.W. (1972). On the inverse relationship of recall and precision. Journal of Documentation 28(3), 195-201. https://doi-org.ezproxy.library.ubc.ca/10.1108/eb026538

Guns, R., Lioma, C. & Larsen, B. (2012). The tipping point: F-score as a function of the number of retrieved items. Information Processing & Management 48(6), 1171-1180. https://doi.org/10.1016/j.ipm.2012.02.009

Holt, J.D. & Miller, D.J. (2009). An evolution of search. Bulletin of the American Society for Information Science & Technology 36(1), 11-15. DOI: 10.1002/bult.2009.1720360105

Lesk, M.E. & Salton, G. (1968). Relevance assessments and retrieval system evaluation. Information Storage and Retrieval 4(4), 343-359. https://doi.org/10.1016/0020-0271(68)90029-6

Manning, C., Raghavan, P. & Schütze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press.

Proximity Operators (2021, February 23). In Wikipedia. Retrieved March 2, 2021, from Course:LIBR557/2020WT2/proximity operators.

Smucker, M.D. (2011). Information representation. In Ruthven, I. & Kelly, D. (Eds.), Interactive Information Seeking, Behaviour and Retrieval (pp. 77-94). https://doi-org.ezproxy.library.ubc.ca/10.29085/9781856049740