Jump to content

Semantic searching

From UBC Wiki
Source: Hierarchical AI Graphic from Preisler, 2024, pg.6.

Compiled by

Updated

See also

Introduction

Semantic searching is an information retrieval method that uses artificial intelligence (AI) and natural language processing (NLP) to interpret the meaning and context of words in a search query, rather than matching exact words, terms or phrases — as in lexical searching. Semantic searching is increasingly allied to AI-powered systems and large language models (LLMs) to recognize synonyms, abbreviations, related concepts, and clinical terminology, making it possible for the clinician to be understood more fully and to retrieve documents based on meaning not just keywords.

Many modern information retrieval systems on the web work by using keywords to find similar, related documents and by matching those documents based on terms in a query. Traditional keyword searching reveals occurrences of words and phrases within a corpus of searchable documents or websites. Conversely, semantic searching aims to get at understanding the context of those words as used by a searcher. In response to searching for papers about heart attack, keyword searching returns documents containing the words “heart” and “attack” while a semantic search will seek out deeper associations. Semantic searching will return results that contain the terms “myocardial infarction,” “acute coronary syndrome,” and “cardiac ischemia,” even if the phrase “heart attack” is not present. Where there is a mismatch in keyword searches due to terms used, or other limitations or expansions around related terms, a semantic search will result in a more complete search — at least in theory!

Lexical vs. semantic searching

Lexical searching and semantic searching are two distinct approaches to information retrieval, and differ in how they process and match queries to relevant information. As mentioned, lexical search approaches focus on matching exact words and phrases in a given query with those in a corpus of records. Lexical searching is important in speed and precision when dealing with specific terms or structured data. On the other hand, semantic searching is better at handling natural language queries, understanding context, and exploring related concepts.

Most bibliographic databases, licensed by libraries, such as MEDLINE and EMBASE use lexical approaches (however, increasingly, they are incorporating aspects of AI). These databases are structured using a controlled vocabulary of subject headings or "index" terms. Historically, controlled terms or subject headings are applied by human indexers to describe the subject content of papers in a database (but this is now an automated indexing or semi-automated process). The drawback to "lexical" searching is that the contextual knowledge surrounding a user’s underlying need may be complex, unlisted in the index, or have other meanings unrepresented by a given query. Semantic searching aims to resolve this challenge. One method in particular, explicit semantic analysis, aims to map a document's content as a graph of concepts. While similar to subject indexing, semantic searching is different in that it is boosted by natural language processing and other AI techniques. The problem with semantic searching is that searchers are unclear as to what is going on under the hood of AI systems, and searches are not reproducible due to their dynamic nature.

Both have their strengths: lexical for pinpoint accuracy, semantic for broader, context-aware exploration.

Note: Search engines such as Google Scholar rely on exact keyword matches, but AI tools such as Elicit.com, Semantic Scholar, and Undermind.ai use semantic understanding to interpret natural language queries in order to find conceptually relevant papers.

References

  • "..retrieval augmented generation (RAG) is an approach to infuse a private knowledge base of documents with large language models (LLMs) to build Generative Q&A (Question-Answering) systems. However, RAG accuracy becomes increasingly challenging as the corpus of documents scales up, with Retrievers playing an outsized role in the overall RAG accuracy by extracting the most relevant document from the corpus to provide context to the LLM. In this paper, we propose the ‘Blended RAG’ method of leveraging semantic search techniques, such as Dense Vector indexes and Sparse Encoder indexes, blended with hybrid query strategies. Our study achieves better retrieval results and sets new benchmarks for IR (Information Retrieval) datasets like NQ and TREC-COVID datasets. We extend a ‘Blended Retriever’ to the RAG system to demonstrate superior results on Generative Q&A datasets like SQUAD, even surpassing fine-tuning performance."

Disclaimer

  • Note: Please use your critical reading skills while reading entries. No warranties, implied or actual, are granted for any health or medical search or AI information obtained while using these pages. Check with your librarian for more contextual, accurate information.