Automated indexing
Compiled by
Updated
See also
IntroductionAutomated indexing has been defined as “indexing the subject content of papers by means of a computer, either with some human intervention and oversight, or none at all”. (Giustini et al, 2025). In 2025, automated indexing can be performed through the use of various computer methods, algorithms (hence, algorithmic indexing), natural language processing and artificial intelligence (AI). Automated indexing refers to both “semi- and/or partly automated” processes depending on the levels of human curation involved. According to Ruiz and Aronson (2009), automatic indexing is a form of text categorization, where documents are assigned terms from a controlled vocabulary by machines in order to summarize their contents. Automated (or, semi-automated) compared to human indexing?A commonly-stated goal of state-of-the-art automated indexing is to mimic human indexing. The principal challenge, however, lies in extracting an exhaustive and precise set of controlled terms that accurately represent the subject content of each document in a database, as a human indexer would. Since 2002, several large-scale MeSH indexing approaches such as MeSHLabeler, DeepMeSH and MeSHProbeNet have been proposed to enhance automated indexing. However, the performance of these models is limited by their reliance on article titles and abstracts; improved results could be achieved by leveraging full-text content. While the National Library of Medicine (NLM) continues to evaluate innovative technologies to improve indexing performance, new challenges persist as novel medical concepts are introduced in the biomedical literature. In fiscal year 2021, the average time required to index articles fully reviewed by human indexers was 145 days, excluding the time needed for bibliographic data review. By 2022, the NLM had implemented a fully automated indexing program for MEDLINE using its Medical Text Indexer (MTI-Auto). Under this system, human review is retained for selected subject areas, while other records are reviewed on a random basis. Indexing time has improved considerably through automation reducing the time to one business day. What is automated indexing in MEDLINE?Automated indexing in MEDLINE is sometimes referred to as algorithmic indexing (see Amar-Zifkin et al (2025)). In the MTI, algorithms are key to the indexing workflow at NLM. In 2022, first-line indexing for all MEDLINE records was performed by the MTIA, with humans limiting their curation to sets involving genes and proteins. In 2025, the NLM uses the MTIX which is based on neural networks technology. According to the Encyclopedia of Knowledge Organization, “[algorithmic] indexing has referred historically to search-engines where automation plays an important role because of the scale of information". Similarly, with semantic indexing, terms used in related documents tend to have similar subject content and meaning. Based on these assumptions, associations between terms that occur in similar documents are calculated, and then concepts for those documents extracted from a corpus. Indexing using semantic AI technologies has scalability and is a major driver of its application in large web search engines. Further, algorithms have been shown to prove indexing consistency, although inconsistencies are not solved by applying automatic indexing methods alone. In other words, automated indexing is not an “objective” process as it reflects the worldview of the texts it indexes, and may perpetuate its own specific perspectives and biases. A reliance on using a large corpus of raw text to return outputs means that these algorithms suffer from their own indexing imprecision and unreliability. Medical text indexer (MTI) and MEDLINEThe Medical Text Indexer (MTI) is the automated indexing tool developed by the National Library of Medicine (NLM) for MEDLINE and represents one of the most significant achievements in large-scale automated indexing by a national library. Its development reflects decades of sustained research, implementation, and evaluation. In 2024, MTIX (Medical Text Indexer–NeXt Generation) replaced MTI-Auto, incorporating machine learning and neural network–based methods to assign MeSH terms to biomedical articles. The primary advantages of MTIX include substantially improved indexing speed and scalability. Trained on millions of MEDLINE citations published between 2007 and 2022, MTIX analyzes article titles, abstracts, and journal metadata to recommend MeSH terms with high recall (e.g., greater than 94% for disease detection) and strong precision (e.g., approximately 87% for disease categories). MTIX supports both semi-automated and fully automated indexing workflows, significantly reducing the workload for human indexers while maintaining indexing standards. Nevertheless, despite an overall F-score of 0.74, the system exhibits an estimated error rate as much as one-third to one half. (Amar-Zifkin et al., 2025; Askin et al., 2025).See Amar-Zifkin et al, 2025; Askin et al, 2025. Neural networks used in the MTIX enable rapid, precise indexing, critical for managing the growing volume of biomedical literature. In 2024, almost 1.4 million papers were added to MEDLINE. While human curation remains in place for quality control, NLM's use of AI supports applications such as the publicly-available MeSH on Demand. For medical texts, NLM says that "... automated indexing is currently based on the title and abstract of articles; future work will investigate automated indexing based on processing of the article’s full text (where NLM has access to that text for computational purposes)" thereby improving term coverage over title-and-abstract-based methods. Filtering techniques, like ranking scores and excluding lengthy documents, further boost accuracy. Since 2020, the National Library of Medicine (NLM) has incorporated Bidirectional Encoder Representations from Transformers (BERT)-based models—such as BioBERT and PubMedBERT. These transformer models underpin the “First-Lines” and “Full-Text” predictors, substantially improving recall for rare MeSH terms and reducing the workload for human indexers. Pretraining on biomedical corpora is critical for these models, as it provides the specialized vocabulary, subword tokenization, and contextual knowledge required to interpret complex medical terminology—capabilities that general-purpose models (e.g., BERT or SciBERT) lack and therefore underperform in MeSH prediction tasks. State-of-the-art MTI performance now combines traditional indexing approaches with transformer-based ranking methods, such as BERT rerankers and cross-encoders, achieving F-scores above 0.70. Despite these and other advances incorporated into MTIX, human indexers remain essential for correcting errors and curating MEDLINE records to ensure indexing quality and consistency. Automated indexing from MTI (2002) to MTI-Auto (2022)Rules-based systems such as the Medical Text Indexer – Automated (MTIA) use human-written instructions (ie., "based on NLM policies, use most specific MeSH term in tree") and ask the underlying algorithm to follow them. In rule based systems, the rules are built automatically from the list for match and synonym rules, that is, "See XYZ, Use XYZ." For example, if a newly-publshed paper contained the phrase “heart attack” in the title, the MTI's algorithm would assign the MeSH heading Myocardial Infarction. While precise, rules-based approaches are rigid and newer terms, synonyms, or complex phrasing could cause the system to miss relevant MeSH. By 2024, machine learning systems, using neural networks, emerged; the NLM implemented the MTIX built from the data in millions of previously indexed records from 2007 to 2022. Instead of relying on fixed rules, the MTIX looked at linguistic patterns and adapted to new terminology, maintaining indexing precision while improving recall. Rules-based systems (2002-2022) worked well for two decades but became prone to errors requiring constant updating and human intervention; the MTIA missed synonyms, misunderstood new terminologies, and made more work for indexers. As the biomedical literature grew, and became more complex, the machine learning approach was evaluated as being better at handling a myriad of linguistic, semantic and other issues. AI-based indexing for MEDLINE now scales up to the 1.5 million papers published annually. Still, human indexers amend records that have been assigned MeSH terms incorrectly. MTIX of 2024MTIX, introduced in 2024, replaced MTIA (Auto) (2019, 2022), which was a legacy rules-based system. Rules-based methods—including earlier versions such as MTI, MTI-FL, and MTIA—relied on hand-crafted rules and heuristics rather than learning directly from data in MEDLINE citations. These systems applied predetermined assignments based on MEDLINE indexing policies, as well as directives embedded in see references and scope notes within the MeSH vocabulary. For example, MTI matched exact keywords in article titles and abstracts to candidate MeSH terms and applied pattern-based rules (e.g., assigning the MeSH term Hip Fractures when phrases such as “fracture of the hip” appeared). Additional rules were used to assess relevance, including word-frequency thresholds and other heuristic semantic techniques. By contrast, MTIX employs data-driven, machine learning–based methods that have dramatically improved indexing efficiency. The MTIX also leverages neural network–based models to learn complex semantic relationships between biomedical text and Medical Subject Headings (MeSH), enabling more accurate and scalable indexing than earlier rule-based systems. By training on millions of MEDLINE citations, these neural architectures capture contextual meaning and synonymy that cannot be encoded through hand-crafted rules. As a result, MTIX achieves faster indexing turnaround while maintaining high recall and precision across diverse biomedical domains. As of 2026, article citations are typically indexed within one day of receipt in NLM’s indexing system. In practical terms, most articles from MEDLINE-indexed journals now appear in PubMed with assigned MeSH terms within one business day. https://www.nlm.nih.gov/bsd/indexfaq.html#descriptor Key errors found in automated indexing recordsThe following list was created in an early analysis for Automated indexing of the biomedical literature in MEDLINE: a scoping review, and based in part on comments from NLM's PubMed Office Hours in 2022 - 2024. In general, algorithmic indexing can perpetuate a range of biases along various dimensions such as gender, sexual orientation and race (however, more research is needed).
For more detail, see Medical Library Association (MLA) 2025 presentation, Automated indexing in MEDLINE. and National Library of Medicine. NLM Medical Text Indexer. NLM Technical Bulletin. March-April 2024. Questions re: impact on comprehensive searchingHealth sciences librarians (HSLs) may wish to consider how automated indexing is reshaping search practices and MEDLINE instruction. Understanding MTIX and its AI-driven features suggests a growing need to test and refine search strategies that combine MeSH and free-text terms to ensure comprehensive retrieval—particularly for very recent, partially indexed, or non-indexed literature. HSLs may also play an important role in communicating the fundamentals of automated indexing to users, sharing emerging best practices with colleagues, and explaining the implications of these changes for search precision and recall in MEDLINE. This raises several questions for practice and professional reflection:
Feel free to share your comments, experiences, and concerns. Dean Giustini UBC Biomedical Librarian dean.giustini@ubc.ca References
Disclaimer
|
