Jump to content

Semantic searching

From UBC Wiki
Source: Hierarchical AI Graphic from Preisler, 2024, pg.6.

Compiled by

Updated

See also

Introduction

Semantic searching is an information retrieval method that uses artificial intelligence (AI) and natural language processing (NLP) to interpret the meaning and context of words in a search query, rather than matching exact words, terms or phrases — as in lexical searching. Semantic searching is increasingly allied to AI-powered systems and large language models (LLMs) to recognize synonyms, abbreviations, related concepts, and clinical terminology, making it possible for the clinician to be understood more fully and to retrieve documents based on meaning not just keywords.

Many modern information retrieval systems on the web work by using keywords to find similar, related documents and by matching those documents based on terms in a query. Traditional keyword searching reveals occurrences of words and phrases within a corpus of searchable documents or websites. Conversely, semantic searching aims to get at understanding the context of those words as used by a searcher. In response to searching for papers about heart attack, keyword searching returns documents containing the words “heart” and “attack” while a semantic search will seek out deeper associations. Semantic searching will return results that contain the terms “myocardial infarction,” “acute coronary syndrome,” and “cardiac ischemia,” even if the phrase “heart attack” is not present. Where there is a mismatch in keyword searches due to terms used, or other limitations or expansions around related terms, a semantic search will result in a more complete search — at least in theory!

Lexical vs. semantic searching vs. vector searching

  • Lexical searching, semantic searching, and vector searching are related but distinct approaches to information retrieval.
  • They differ in how they process queries and match them to relevant information. Lexical searching focuses on matching exact words or phrases in a user’s query with those found in a corpus of records. This approach excels in speed, transparency, and precision, particularly when searching for known items, specific terminology, or structured data.
  • Most bibliographic databases licensed by libraries—such as MEDLINE and EMBASE—have traditionally relied on lexical approaches, supported by controlled vocabularies of subject headings and index terms. Historically, controlled terms were applied by human indexers to describe the subject content of articles, though indexing is now increasingly automated or semi-automated. While highly effective for precision searching, lexical methods can struggle when a user’s information need is complex, poorly articulated, uses unfamiliar terminology, or involves concepts not well represented in the controlled vocabulary.
  • Semantic searching aims to address these limitations by focusing on meaning rather than exact term matching. Using natural language processing and other AI techniques, semantic searching attempts to understand context, intent, and relationships among concepts. One early method, explicit semantic analysis, represents documents as vectors of concepts derived from knowledge bases, mapping content into a conceptual space rather than relying solely on keywords or headings.
  • Vector-based searching and embeddings is a more recent, influential implementation of semantic search. It represents queries and documents as numerical embeddings in a high-dimensional vector space, allowing retrieval based on similarity of meaning rather than shared vocabulary. Vector search can surface relevant documents even when they do not share obvious lexical overlap with a query, making it effective for exploratory searching and natural language queries.
  • Unlike traditional lexical searching, vector-based systems are often opaque: relevance ranking is difficult to explain, results may vary over time as models are updated, and searches are not easily reproducible. These characteristics raise concerns for systematic searching, transparency, and auditability—especially in evidence-based disciplines.
  • As Tay says, embedding-based vector search may be one of the least objectionable uses of AI in search precisely because it complements, rather than replaces, traditional lexical methods. When used as an assistive layer—supporting discovery while leaving structured, transparent search strategies intact—vector search can enhance recall without undermining the methodological rigor required in scholarly and systematic searching.

Note: Search engines such as Google Scholar rely on exact keyword matches, but AI tools such as Elicit.com, Semantic Scholar, and Undermind.ai use semantic understanding to interpret natural language queries in order to find conceptually relevant papers.

References

  • Authors found that "....Consensus, Evidence Hunt, Lens.org, and Semantic Scholar were the most useful tools, having a ranking of 9 out of 10. Elicit.com, Litmaps, OpenAlex, and Scinapse closely followed with 8 out of 10".
  • "..retrieval augmented generation (RAG) is an approach to infuse a private knowledge base of documents with large language models (LLMs) to build Generative Q&A (Question-Answering) systems. However, RAG accuracy becomes increasingly challenging as the corpus of documents scales up, with Retrievers playing an outsized role in the overall RAG accuracy by extracting the most relevant document from the corpus to provide context to the LLM. In this paper, we propose the ‘Blended RAG’ method of leveraging semantic search techniques, such as Dense Vector indexes and Sparse Encoder indexes, blended with hybrid query strategies. Our study achieves better retrieval results and sets new benchmarks for IR (Information Retrieval) datasets like NQ and TREC-COVID datasets. We extend a ‘Blended Retriever’ to the RAG system to demonstrate superior results on Generative Q&A datasets like SQUAD, even surpassing fine-tuning performance."

Disclaimer

  • Note: Please use your critical reading skills while reading entries. No warranties, implied or actual, are granted for any health or medical search or AI information obtained while using these pages. Check with your librarian for more contextual, accurate information.