Bidirectional Encoder Representations from Transformers (BERT)

Author

Dean Giustini, UBC Biomed librarian, dean.giustini@ubc.ca

Updated

12 March 2026 | Part of Knowledge Synthesis (KS) & AI Search Wiki 2026 & A to Z Listing

Overview

I have been doing research into automated indexing, and thus my comments are put in that context as follows:

BERT language models are used to improve indexing, classification, and textual understanding.

BioBERT is pre-trained on biomedical texts in PubMed.gov and PubMedCentral (PMC);
DistilBERT is a smaller, faster, and lighter version of original BERT;
PubMedBERT is a transformer model pretrained on PubMed text;
SciBERT is a model pre-trained on biomedical and computer science papers from Semantic Scholar.

See Giustini D, Chen E, Kung J, Amar-Zifkin A. Automated indexing of the biomedical literature in Medline: a scoping review [protocol [Internet]. Open Science Framework (OSF); 2025. Available from: https://osf.io/g4q8u/]

BioBERT and PubMedBERT

BioBERT and PubMedBERT adapt BERT for the biomedical domain by pretraining on PubMed abstracts and full-text articles in PubMed Central. They generate embeddings that capture domain-specific terminology, acronyms, and conceptual relationships. (Both models are resource intensive to train and deploy, and consume significant computational power and energy.)

DistilBERT

DistilBERT is a distilled version of BERT that maintains much of BERT’s semantic capabilities while reducing model size and computational costs. DistilBERT and other compressed variants retain BERT’s semantic capabilities while reducing model size, inference time, and power requirements, offering a more energy-efficient alternative for large-scale indexing tasks.

SciBERT

SciBERT is a transformer-based language model pretrained on a large corpus of scientific literature from the biomedical and physical sciences. By learning domain-specific language patterns and terminology, it improves semantic similarity detection and document–concept matching, leading to more accurate automated indexing and information retrieval. However, pretraining SciBERT requires substantial high-performance computing resources, raising concerns about energy consumption and the environmental impact of large-scale language model development.

Environmental and climate impact

Training and deploying BERT-based models is computationally intensive, requiring substantial processing power, memory, and specialized hardware such as GPUs or TPUs. These demands translate into significant energy consumption, particularly during large-scale pretraining and repeated fine-tuning cycles. For libraries and research institutions, this raises important operational considerations, including infrastructure costs, system sustainability, and long-term maintenance. Environmental concerns are also increasingly relevant as the carbon footprint of large language models can be considerable. Understanding these trade-offs allows librarians and information professionals to make informed decisions about adopting, evaluating, or relying on BERT-driven tools within discovery systems and research workflows.

Why should librarians care?

Librarians should care about BERT models because they directly affect how information is indexed, discovered, and retrieved—core professional concerns. BERT improves how systems understand language by analyzing words in context rather than as isolated terms. This shift mirrors how users actually search: with natural language questions, ambiguous phrasing, and complex research questions that need to be answered.

Traditional searching often:

matches exact words
treats words one-by-one

BERT-style systems:

try to understand the meaning of the question
look at relationships between words
are better at “searching for answers” than “searching for sources”

BERT doesn’t show its reasoning:

can’t always explain why it made a choice
may miss nuance, bias, or methodological quality

BERT and searching

Modern discovery systems and search engines increasingly rely on transformer-based models such as BERT to rank results, interpret user intent, support semantic retrieval, and extract concepts from text.
Traditional models based on keyword matching, controlled vocabularies, and Boolean logic may be displaced by probabilistic, context-aware relevance scoring. Understanding this shift from deterministic retrieval to neural ranking helps librarians explain unexpected search results and diagnose retrieval failures.
Librarians who understand BERT can better design and troubleshoot search strategies. A good way to begin is understanding BERT integration in automated indexing.
Knowledge of how contextual embeddings influence ranking enables more effective query formulation, especially for complex or ambiguous topics. BERT has implications for metadata creation and indexing practices.
The shift from rule-based and controlled indexing toward probabilistic language modeling reframes long-standing debates about subject authority, exhaustivity, and precision–recall trade-offs. Bias and representation are central concerns.
BERT models learn from a large corpus that may underrepresent marginalized voices or reinforce dominant perspectives. Librarians’ expertise in collection development, ethical stewardship, and transparency is essential to critically evaluate, audit, and make visible how such models shape access to knowledge.
BERT underpins emerging tools in automated indexing, text summarization, and question answering. Librarians involved in research support, systematic reviews, and data services must assess these tools critically, understanding both their efficiencies and their epistemic limitations.
BERT models influence discovery infrastructure, user experience, and equity of access.

References

Disclaimer

Note: Please use your critical reading skills while reading entries. No warranties, implied or actual, are granted for any health or medical search or AI information obtained while using these pages. Check with your librarian for more contextual, accurate information.