Course:LIBR557/2020WT2/proximity operators

From UBC Wiki

Proximity Operators

Proximity operators are used in information retrieval to specify a distance between two query terms. A query with proximity operators will retrieve documents in which the terms appear within the specified distance of each other. The use and syntax of proximity operators varies by search system.

Examples of proximity operators that could be used in a query for the word "search" near the word "term" include:

search near5 term

search near/5 term

search SAME term

Proximity operators are a variant of the Boolean operator AND. All of the terms in a query that uses a proximity operator must co-occur in a document for it to be retrieved (Morton, 1993, p. 56). In an AND search, query terms that co-occur may be spread apart by any distance within a document.[1]

Proximity operators came into use as full-text indexing and indexing of longer document surrogates such as abstracts became common (Keen, 1991, p. 89). They are especially useful when searching full text, as they can be used “to tweeze out nuances and secondary topics that cannot possibly be covered by a limited number of subject headings,” according to Bell (2015, p. 52). Librarians and other advanced searchers use proximity operators in creating search strategies, particularly for systematic reviews.[2]

The underlying principle of using proximity operators is that the physical closeness of terms in a document correlates to the closeness of the terms’ subject relationship (Morton, 1993, p. 56). For example, a Boolean query for “coronavirus AND vaccine” would retrieve all documents in which the two terms appeared just once, even if “vaccine” was in the introduction and “coronavirus” was mentioned once within the findings section. A proximity operator could be used to specify a distance relationship requirement, such as one that searches for “coronavirus” in the same paragraph as “vaccine,” or for “coronavirus” within a two words’ distance of “vaccine.”

Requiring only co-occurrence of query terms (as with AND) maximizes recall for that combination of terms, but may also retrieve many non-relevant records (Kostoff, Rigsby & Barth, 2006, p. 582). If used correctly, proximity operators can achieve a balance of precision and recall. However, Keen points out, “if pursued too narrowly, recall will suffer, and if too broadly no improvement in precision will be experienced” (1991, p. 90).  

Variant expressions

Proximity operators are used to retrieve documents that use varying expressions of a concept (Kostoff, Rigsby & Barth, 2006, p. 582). For example, while querying the phrase “coronavirus vaccine” might not retrieve documents that instead refer to “vaccine for coronavirus,” a proximity operator could be used to specify that the terms should occur within two words of each other, as in: coronavirus near/2 vaccine.

Names are often expressed with different variants, as with “Safiya Noble,” “Safiya Umoja Noble,” and “Safiya U. Noble.” A search for safiya near/2 noble would retrieve documents with all three variants.

Adjacency

Kostoff, Rigsby and Barth (2006) describe proximity searching as one of two types of “constrained co-occurrence searching” (p. 582).[3] They distinguish proximity from adjacency because with proximity, the order of words is not important, just the specified distance between them. In other instances, adjacency operators—for which the order of search terms does matter—are considered to be a type of proximity operator.

Challenges

Proximity operators work differently in different search systems (Morton, 1993, p. 57), with variations in what proximity operators the system recognizes and how to input them. Some systems include stop words in the distance. Users should refer to a system’s manual or advanced search to learn which proximity operators can be used and how to use them. Operators are sometimes referred to as “connectors,” as in Westlaw Edge (2019). In Boolean systems that allow multiple operators to be combined in a single search string, it is important to be aware in what order the system will process operators.

Popular Indexes and Databases

Factiva

Factiva includes the proximity operators Near[x] and Same. They are not case sensitive, but must be separated from both search terms by white space. X is the distance between two terms. If no value is specified for x, the default is 1. Values from 1–500 are valid. Same will retrieve documents in which the two search terms appear in the same paragraph. Factiva also includes an adjacency operator: ADJx. If x is not specified, it is assumed to be 1, and 1–10 are valid distances for x. (Factiva.com, n.d.).

JSTOR

In JSTOR, the tilde symbol ~ followed by the specified distance is used as a proximity operator when it follows two search terms (JSTOR, n.d.). The syntax looks like this: vaccine coronavirus ~4

Web of Science

Web of Science includes two proximity operators: NEAR/x and SAME. The x is the maximum distance between two terms. If no x value is specified, the system defaults to a distance of 15 words (Clarivate Analytics, 2020). A search with this proximity operator could look like:

coronavirus near/3 vaccine

(coronavirus or COVID-19) NEAR vaccine

Westlaw Edge

In Westlaw Edge, /p can be used between two search terms to find them in the same paragraph, and /s to find two words in the same sentence. /x can be used between two terms to specify a distance between 1 and 255 (Westlaw Edge, 2019). For example: (lummi or lhaqthemish or lhaq'themish) /175 treat!

Beyond Boolean

Proximity as a scoring factor was not originally built into vector space or probabilistic models of information retrieval. Many empirical studies have looked at applying term proximity to relevance ranking algorithms. Tao and Zhai (2007) isolated proximity from other statistical properties to see its impact on relevance, and evaluated different measures of proximity distance to identify which correlate the most with document relevance. He, Huang, and Zhou (2010) found that using term proximity boosts the effectiveness of the classical probabilistic model. They found that the order in which query terms appear in a document is not important for relevance weighting, and that the ideal distance between query terms (in English-language documents) is 10 words or fewer. Uematsu, Inoue, Fujioka, Kataoaka, and Ohwada (2008) went in a different direction, proposing the use of a sentence-based inverted index instead of indexing word position data, to measure proximity based on appearance in the same sentence.

Bibliography

Bell, S. S. (2015). Librarian’s Guide to Online Searching: Cultivating Database Skills for Research and Instruction, 4th Edition. Libraries Unlimited.

Clarivate Analytics. (2020). Web of Science core collection help. https://images.webofknowledge.com/images/help/WOS/hs_search_operators.html#dsy862-TRS_proximity

Factiva.com. (n.d.). Search statement operators. http://factiva.com/CP_Developer/ProductHelp/FDK/FDK38/search_factiva/query_expressions/search_statement_operators.htm

He, B., Huang, J. X., & Zhou, X. (2011). Modeling term proximity for probabilistic information retrieval models. Information Sciences, 181(14), pp. 3017–3031. https://doi.org/10.1145/2590988

JSTOR. (n.d.). Searching: Truncation, wildcards and proximity. https://support.jstor.org/hc/en-us/articles/115012261448-Searching-Truncation-Wildcards-and-Proximity#proximity

Keen, E. M. (1992). Some aspects of proximity searching in text retrieval systems. Journal of Information Science, 18(2), pp. 89–98. DOI: 10.1177/016555159201800202

Kostoff, R., Rigsby, J., & Barth, R. (2006). Brief communication: Adjacency and proximity searching in the Science Citation Index and Google. Journal of Information Science, 32 (6), pp. 581–587. DOI: 10.1177/0165551506067126

Morton, D. (1993). Refresher course: Getting next to proximity operators. Online, 17(6), pp. 56–58. ProQuest. Retrieved from https://ezproxy.library.ubc.ca/login?url=https://www-proquest-com.ezproxy.library.ubc.ca/trade-journals/refresher-course-getting-next-proximity-operators/docview/199913322/se-2?accountid=14656

Tao, T., & Zhai, C. (2007). An exploration of proximity measures in information retrieval. In W. Kraaij, A. de Vries, C. Clarke, N. Fuhr, & N. Kando (Eds.), Proceedings of the 30th Conference Special Interest Group on Information Retrieval. (pp. 295–302). ACM Digital Library. DOI: 10.1145/1277741.1277794  

Uematsu, Y., Inoue, T., Fujioka, K., Kataoka, R., & Ohwada, H. Proximity scoring using sentence-based inverted index for practical full-text search. In ECDL 2008: Research and Advanced Technology for Digital Libraries, pp. 308–319. International Conference on Theory and Practice of Digital Libraries. https://doi.org/10.1007/978-3-540-87599-4_33

Westlaw Edge. (2019). How to search with Boolean terms and connectors. Thomson Reuters. https://legal.thomsonreuters.com/content/dam/ewp-m/documents/legal/en/pdf/quick-reference-guides/tr906479-how-to-search-with-boolean-terms-and-connectors.pdf

Footnotes

  1. Some search systems include options to use operators within certain fields or sections of documents.
  2. For an example, see: Bramer, W. M., de Jonge, G. B., Rethlefsen, M. L., Mast, F., & Kleijnen, J. (2018). A systematic approach to searching: an efficient and complete method to develop literature searches. Journal of the Medical Library Association, 106(4), pp. 531–541. dx.doi.org/10.5195/jmla.2018.283
  3. Kostoff, Rigsby and Barth (2006) distinguish between two types of constrained co-occurrence searching: adjacency and proximity. In adjacency searching, the order of words is specified in addition to a fixed distance or range (p. 582).