Jump to content

Vector-based searching and embeddings

From UBC Wiki
Algorithms vs. artificial intelligence vs. machine learning vs. deep learning (Author: Johannes Vrana, CC BY-ND 4.0)

Compiled by

Updated

See also

Introduction

Vector-based searching (semantic searching or embedding-based searching) refers to a type of information retrieval method that aims to find and rank results based on the semantic meaning of content rather than matching the exact keywords or freetext within the search query. It represents documents and queries as numerical vectors and retrieves results by measuring similarity between those vectors - typically using distance metrics such as cosine similarity.

Traditional keyword searching (and even controlled vocabulary-driven searching) relies on lexical matching, meaning results are returned only when query terms explicitly appear in records. Vector-based searching, by contrast, encodes semantic meaning using machine learning models, enabling systems to retrieve relevant content even when different words, synonyms, or paraphrases are used. In vector-based systems, documents and queries are represented as numerical embeddings within a high-dimensional vector space. A high-dimensional vector space is a mathematical way of representing meaning using many numerical features (dimensions). In vector-based search, each document or query is converted into a list of numbers "an embedding" where each number captures some aspect of its meaning. Items with similar meanings are positioned close together in this space, allowing retrieval based on conceptual similarity rather than literal text overlap.

How vector-based searching works

Vector-based searching typically involves four stages:

1) Embedding

Content such as documents, sentences, images, or audio is converted into numerical representations called vector embeddings using a machine learning model. Queries are embedded using the same model.

2) Indexing

Embeddings are stored in a specialized index or vector database. To enable fast retrieval at scale, systems commonly use Approximate Nearest Neighbor (ANN) algorithms.

3) Querying

When a user submits a query, it's "transformed" into a vector embedding.

4) Similarity matching

The system calculates similarity between the query vector and stored vectors using distance metrics such as:

Results are ranked by closeness in the vector space.

Comparison with keyword search

Feature Keyword Search Vector-Based Search
Matching method Exact word "lexical" matching Semantic similarity
Synonym handling Limited Strong
Sensitivity to phrasing High Low
Context awareness Minimal High
Multilingual capability Limited Often supported

Models used

Vector-based searching typically relies on encoder models, which generate embeddings rather than text. Common examples include:

These models differ from large language models (LLMs), which are designed primarily for text generation rather than semantic encoding.

Applications

Vector-based searching is widely used in the following search and information retrieval systems:

  • Web search engines;
  • Academic and biomedical databases;
  • Retrieval augmented generation (RAG) systems;
  • Recommendation systems;
  • Chatbots and question-answering systems;
  • Image and multimodal search

Hybrid search (ensemble) approaches

Many modern search systems implement hybrid search techniques, combining vector-based (semantic) search with traditional keyword or Boolean search approaches. This ensemble method leverages the strengths of both: vector-based retrieval improves semantic recall by capturing meaning and synonymy, while keyword-based methods provide lexical precision, exact matching, and reliable filtering.

By integrating approaches, hybrid search systems can return results that are both contextually relevant and textually exact, improving overall retrieval quality in complex information environments such as academic databases, search platforms, and modern AI-powered search tools.

Advantages

  • Improved recall for semantically related content;
  • Robust handling of synonyms and paraphrases;
  • Better support for natural-language queries;
  • Cross-lingual and multimodal capabilities.

Limitations

  • Higher computational costs;
  • Reduced transparency compared to keyword matching, "opaque" and black box effects;
  • Potential semantic false positives;
  • Dependence on model quality and training data.

Librarian perspectives

Librarians tend to view VBS as a pragmatic and generally positive development, but with some important cautions related to search transparency and reproducibility. An important, emerging perspective is that hybrid search is an improvement over single-method retrieval, because it combines:

  • Vector/semantic search, which helps surface relevant materials even when terminology differs (useful for interdisciplinary research, clinical synonyms, or evolving language)
  • Keyword/Boolean search, which preserves precision, explicit control, and repeatability

From a health and academic librarianship standpoint (e.g., biomedical databases such as PubMed/MEDLINE), our systems already combine controlled vocabularies (e.g., MeSH terms) with keyword searching. We see modern hybrid systems as an extension of longstanding retrieval principles rather than a completely new idea. Further, vector-based searching and embeddings are viewed as a useful but imperfect augmentation of traditional retrieval systems. The consensus is that it works best when it is transparent, well-documented, and paired with explicit search strategies, especially in research contexts like systematic reviews where rigour and reproducibility matter.

References

Note: I have read widely on this topic, and will be populating this section with an extensive bibliography to support the entry. This is a complex topic so thank you for your patience while I write this entry for librarians and information professionals. Some content was informed by the Wikipedia entry: https://en.wikipedia.org/wiki/Vector_database and https://learn.microsoft.com/en-us/azure/cosmos-db/vector-database and What is Vector search"? https://learn.microsoft.com/en-us/training/modules/improve-search-results-vector-search/2-vector-search

  • The use of semantic search with the help of vector databases has become an impressive paradigm of retrieving the pertinent information by offering the contextual and conceptual sense of the information searching more than using the conventional methods of keyword searching. This paper provides an in-depth overview of the models of vector representation, transformer-based semantic encoders, and technologies of vectors database that jointly allow efficient and error-free semantic search.
  • This paper investigates the enhancement of scientific literature chatbots through retrieval-augmented generation (RAG), with a focus on evaluating vector- and graph-based retrieval systems. The proposed chatbot leverages both structured (graph) and unstructured (vector) databases to access scientific articles and gray literature, enabling efficient triage of sources according to research objectives. To systematically assess performance, we examine two use-case scenarios: retrieval from a single uploaded document and retrieval from a large-scale corpus. Benchmark test sets were generated using a GPT model, with selected outputs annotated for evaluation. The comparative analysis emphasizes retrieval accuracy and response relevance, providing insight into the strengths and limitations of each approach. The findings demonstrate the potential of hybrid RAG systems to improve accessibility to scientific knowledge and to support evidence-based decision making.

Disclaimer

  • Note: Please use your critical reading skills while reading entries. No warranties, implied or actual, are granted for any health or medical search or AI information obtained while using these pages. Check with your librarian for more contextual, accurate information.