Automated indexing
Compiled by
Updated
See also
IntroductionAutomated indexing has been defined as “indexing the subject content of papers by means of a computer, either with some human intervention and oversight, or none at all”. (Giustini et al, 2025). In 2025, automated indexing can be performed through the use of various computer methods, algorithms (hence, algorithmic indexing), natural language processing and even artificial intelligence (AI). Automated indexing can refer to “semi- and/or partly automated” processes depending on the levels of curation involved. According to Ruiz and Aronson (2009), automatic indexing is a form of text categorization, where documents are assigned terms from a controlled vocabulary by machines in order to summarize their contents. Automated (or, semi-automated) compared to human indexing?A commonly-stated goal of state-of-the-art automated indexing is to mimic human indexing; however, its main challenge is to extract an exhaustive and precise set of terms just as a human indexer would to represent the subject content of every document in a database. By 2022, NLM had implemented fully automated indexing in MEDLINE using the Medical Text Indexer (MTI-Auto) with human review for certain subjects, and other records reviewed at random. Since 2002, several large-scale MeSH indexing approaches have been proposed to improve automated indexing such as the MeSHLabeler, DeepMeSH and MeSHProbeNet. However, performance of these models is hampered by use of titles and abstracts where better results can be achieved via a paper's full-text. NLM continues to evaluate innovative technologies to improve indexing performance but new problems seem to arise as new medical concepts are introduced in biomedical papers. What is automated indexing in MEDLINE?Automated indexing in MEDLINE is sometimes referred to as algorithmic indexing (see Amar-Zifkin et al (2025)). In the MTI, algorithms are key to the indexing workflow at NLM. In 2022, first-line indexing for all MEDLINE records was performed by the MTIA, with humans limiting their curation to sets involving genes and proteins. In 2025, the NLM uses the MTIX which is based on neural networks technology. According to the Encyclopedia of Knowledge Organization, “[algorithmic] indexing has referred historically to search-engines where automation plays an important role because of the scale of information". Similarly, with semantic indexing, terms used in related documents tend to have similar subject content and meaning. Based on these assumptions, associations between terms that occur in similar documents are calculated, and then concepts for those documents extracted from a corpus. Indexing using semantic AI technologies has scalability and is a major driver of its application in large web search engines. Further, algorithms have been shown to prove indexing consistency, although inconsistencies are not solved by applying automatic indexing methods alone. In other words, automated indexing is not an “objective” process as it reflects the worldview of the texts it indexes, and may perpetuate its own specific perspectives and biases. A reliance on using a large corpus of raw text to return outputs means that these algorithms suffer from their own indexing imprecision and unreliability. Medical text indexer (MTI) and MEDLINEThe Medical Text Indexer (MTI) is the automated indexing tool developed by the National Library of Medicine (NLM) for MEDLINE. The MTI is one of the more impressive achievements for any national library, accomplished over decades of research, implementation and testing. In 2024, the MTIX (Medical Text Indexer-NeXt Generation) replaced the MTI-Auto and used machine learning and neural networks to assign MeSH terms to articles. Its main benefits were improved indexing speed and scalability. Trained on millions of MEDLINE citations from 2007–2022, the MTIX analyzes titles, abstracts, and journal metadata to recommend MeSH terms with high recall (e.g., >94% for disease detection) and precision (e.g., 87% for disease categories). MTIX supports semi-automated and fully automated indexing, reducing the workload for human indexers while maintaining standards - still, it has an error rate of 10% based on an F-score of .90. See Amar-Zifkin et al, 2025; Askin et al, 2025. Neural networks used in the MTIX enable rapid, precise indexing, critical for managing the growing volume of biomedical literature. In 2024, almost 1.4 million papers were added to MEDLINE. While human curation remains in place for quality control, NLM's use of AI supports applications such as the publicly-available MeSH on Demand. For medical texts, NLM says that "... automated indexing is currently based on the title and abstract of articles; future work will investigate automated indexing based on processing of the article’s full text (where NLM has access to that text for computational purposes)" thereby improving term coverage over title-and-abstract-based methods. Filtering techniques, like ranking scores and excluding lengthy documents, further boost accuracy. Transformers have significantly improved Medical Subject Headings (MeSH) indexing in MEDLINE, the core task of Medical Text Indexing (MTI). Pretraining on in-domain biomedical text is critical: it gives the models the vocabulary, subword tokenization, and background knowledge needed to understand complex medical terminology, which general-purpose models (like base BERT or even SciBERT) lack and underperform on MeSH prediction, Since 2020, the National Library of Medicine (NLM) has incorporated BERT-based models (BioBERT, PubMedBERT) and later large language models into the MTI pipeline. These transformer models power the “First-Lines” and “Full-Text” predictors, dramatically boosting recall for rare MeSH terms and reducing the indexing workload for human revisers. State-of-the-art MTI performance now combines traditional approaches with transformer-based ranking (e.g., BERT rerankers and cross-encoders), achieving F-scores above 0.70. Despite these advancements to the MTIX, human indexers are still needed to correct and curate Medline records. Automated indexing from MTI (2002) to MTI-Auto (2022)Rules-based systems such as the Medical Text Indexer – Automated (MTIA) use human-written instructions (ie., "based on NLM policies, use most specific MeSH term in tree") and ask the underlying algorithm to follow them. In rule based systems, the rules are built automatically from the list for match and synonym rules, that is, "See XYZ, Use XYZ." For example, if a newly-publshed paper contained the phrase “heart attack” in the title, the MTI's algorithm would assign the MeSH heading Myocardial Infarction. While precise, rules-based approaches are rigid and newer terms, synonyms, or complex phrasing could cause the system to miss relevant MeSH. By 2024, machine learning systems, using neural networks, emerged; the NLM implemented the MTIX built from the data in millions of previously indexed records from 2007 to 2022. Instead of relying on fixed rules, the MTIX looked at linguistic patterns and adapted to new terminology, maintaining indexing precision while improving recall. The MTIX of 2024 replaced the MTIA (Auto) (2019, 2022) because it was a legacy “rules-based” system. Rules-based methods refer to previous algorithms (MTI, MTI-FL, MTIA) and their reliance on hand-crafted rules and heuristics, rather than learning from data found in MEDLINE citations. The rules-based MTI performed a range of pre-determined assignments according to MEDLINE indexing policies, and directions embedded in see references and scope notes in the MeSH vocabulary. For example, the MTI matched exact keywords found in titles/abstracts of articles to possible MeSH terms; identified patterns (e.g., if “fracture of the hip” appeared in title or abstract, MTI would assign “Hip Fractures” as a MeSH term); the MTI used rules to decide relevance such as word frequency thresholds, among other semantic techniques. Rules-based systems (2002-2022) worked well for two decades but became prone to errors requiring constant updating and human intervention; the MTIA missed synonyms, misunderstood new terminologies, and made more work for indexers. As the biomedical literature grew, and became more complex, the MTIX’s machine learning approach was evaluated as being better at handling a myriad of linguistic, semantic and other issues. AI-based indexing for MEDLINE scales up to the 1.5 million papers published annually. Still, human indexers amend records that have been assigned MeSH terms incorrectly. With an F-score of 90, this means that 10% of all records will need human curation at some point. Key errors found in automated indexing recordsThe following list (incomplete) was created in an early analysis for a study entitled Automated indexing of the biomedical literature in MEDLINE: a scoping review, and based in part on comments from NLM's PubMed Office Hours in 2022 - 2024. In general, algorithmic indexing can perpetuate a range of biases along various dimensions such as gender, sexual orientation and race (however, more research is needed).
For more detail, see Medical Library Association (MLA) 2025 presentation, Automated indexing in MEDLINE. and National Library of Medicine. NLM Medical Text Indexer. NLM Technical Bulletin. March-April 2024. Questions re: impact on comprehensive searchingHealth sciences librarians (HSLs) may want to consider the impact of automated indexing on their search practices and MEDLINE instruction. Understanding MTIX's AI features may mean that HSLs should test more search filtering and seach strategies combining MeSH and freetext terms to ensure comprehensive retrieval, particularly for recent or non-indexed literature. HSLs may also want to communicate the basics of automated indexing to users, share best practices at meetings, and explain the implications of same on search precision and recall in MEDLINE.
Feel free to share your comments and concerns: Dean Giustini, UBC Biomed librarian, dean.giustini@ubc.ca References
Disclaimer
|

