Note: Any discussion about AI geared towards librarians should start with a look at the ethical, legal, institutional and strategic concerns many librarians have about AI. Talk to your colleagues / librarian about your concerns to make informed decisions.
Scope
The aim of this entry is to examine papers that evaluate the use of AI search tools in comprehensive searching, literature reviews, and knowledge synthesis. As part of this discussion are research assistants (or chatbots) that merge artificial intelligence–driven capabilities with traditional library database functionality, allowing researchers to take advantage of the power of AI to quickly and efficiently locate articles on topics along with providing summaries and evaluations of those articles. It all seems too good to be true, so what does the evidence tell us so far?
My research questions (RQs) are: 1) What are the most popular and researched AI-powered search tools and 2) How are they currently used or deployed within knowledge synthesis (KS) activities? (Comments in bold are the author’s;published, peer-reviewed papers in red are a good read and well-conducted). There are summaries for ~40 papers, and a list of chatbots, links and websites at the end. Note that the information provided to you on this page can change, so please check each tool's website for the most current information (or discuss with a librarian).
Background
In June 2025, the following paper was published which looks at AI in health and medical libraries generally:
Sen S. AI and generative AI in health and medical libraries: a scoping review of present use and emerging potential. Journal of EAHIL. 2025 Jun 10;21(2).https://ojs.eahil.eu/JEAHIL/article/view/675 The paper includes systematic searching and AI; "...the evidence strongly suggests that AI should serve as an augmentative tool rather than a replacement for librarians."
For a 2024 discussion of evaluating AI in searching to support KS, see:
Canada’s Drug Agency. Development of an Evaluation Instrument on Artificial Intelligence Search Tools for Evidence Synthesis. Canadian Journal of Health Technologies. 2024 Oct 18;4(10). https://www.canjhealthtechnol.ca/index.php/cjht/article/view/AI0001/2190This 2024 paper aims to discuss automation technologies, including generative AI, to assist with search tasks for evidence syntheses; however, it seems incomplete given rapid progress in AI-powered searching in 2025. Still, it has developed its AI Search Tool Evaluation Instrument, which can be adapted.
Introduction
The AI search space (ie., using AI technologies and tools to conduct searches) is fast-moving and highly volatile in mid-2025.
In medicine, there are a number of tools being tested to perform literature searches and synthesis of papers. For example, Elicit.com and Open Evidence (which performs as a type of point-of-care tool, and is backed by physicians) are two typical, emerging tools.
Elicit is perhaps the most-studied platform. In early 2025, Elicit.com released its AI-powered search and synthesis tool initially called “Start A Systematic Review”. Now it is simply a channel they call “Systematic Reviews”. The papers using Elicit as a comparator (ie., as overviews; case studies, reviews in head-to-head comparisons) are the most robust studies we have so far.
All of these papers are discussed in the body of this document. Since the collation of these papers during my study leave, I have had the pleasure of meeting with the developer(s) at Elicit.com to learn more about their models and service. I have a lot of questions about its application for our work in supporting faculty and researchers at UBC.
Park (2025) examines seven (7) tools, including Elicit. Examines searching features mostly but other elements also.
Bernard (2025) case study of Elicit vs. one only human conducted umbrella review;
Bolanos (2024) a primer that includes Elicit and other tools;
Dukic (2025) evaluates Scispace vs. Elicit in assisting with reviews;
Meliante (2024); clinicians evaluated Scite and Elicit to search for articles on “Glaucoma, pseudoexfoliation and Hearing Loss” comparing results with human-conducted PRISMA-based SLR
Seth (2025) clinicians conducted three-way comparison; examined AI search engines (Elicit, Consensus, ChatGPT) vs. manual search for literature retrieval, focusing on osteoarthritis.
Spillias (undated) tested (GPT4-Turbo and Elicit).
Williamson (2025) evaluates SciSpace, Semantic Scholar, Elicit, Google Scholar, Research Rabbit, PubMed and CAB Abstracts. A vet topic was selected to test the success of AI tools in searching for academic sources. The authors searched for scholarly literature on the topic, colic AND horses AND microbiome in each of the AI tools (SciSpace, Semantic Scholar, Elicit, Research Rabbit and Google Scholar) and databases (PubMed and CAB Abstracts)
Dean: This study was led by a librarian; aspects of it are well-done.Authors capitalize on advances in large language models (LLMs) and a large dataset of natural language descriptions of reviews and corresponding Boolean searches to generate search queries from SR titles. Used a training dataset of 10,346 SR search queries registered in PROSPERO to fine-tune models to generate search queries. Authors evaluated models using a dataset of N=57 SRs and via semi-structured interviews with 8 experienced medical librarians. The model-generated search queries had sensitivity of 85% (interquartile range [IQR] 40%-100%) and number needed to read 1206 citations (IQR 205-5810). Dean: Their model lacks precision but may be useful for topic scoping or as initial queries to be refined ...See also: Adam GP, Davies M, George J, Caputo E, Htun JM, Coppola EL, Holmer H, Kuhn E, Wethington H, Ivlev I, Balk EM, Trikalinos TA. Machine Learning Tools To (Semi-)Automate Evidence Synthesis: A Rapid Review and Evidence. Rockville (MD): Agency for Healthcare Research and Quality); 2024 Dec. Report No.: 25-EHC006.Dean: "...tools, particularly for automatically identifying RCTs and prioritizing relevant abstracts in screening, show a high level of recall and precision, suggesting they are useful when used with human oversight. Other tools, such as those for searching and data extraction, show highly variable performance and are not yet reliable enough for semi-automation."
2. Alaniz L, Vu C, Pfaff MJ. The utility of artificial intelligence for systematic reviews and Boolean query formulation and translation. Plastic and Reconstructive Surgery–Global Open. 2023 Oct 1;11(10):e5339. https://pubmed.ncbi.nlm.nih.gov/37908326/
The formulation of precise search strings and syntax is key in conducting a successful systematic review. By using ChatGPT, researchers test a system to harness natural language processing and generate search strings and Boolean queries. (Dean: paper doesn't consider shortcomings of LLMs and ChatGPT. Authors do not understand controlled vs. free text searching and do not consider errors made by ChatGPT. Authors don't consider LLM training datasets have a cut-off date eg, 2021 and are not current).
3. Bernard N, Sagawa Y Jr, Bier N, Lihoreau T, Pazart L, Tannou T. Using artificial intelligence for systematic review: the example of Elicit. BMC Med Res Methodol. 2025 Mar 18;25(1):75. https://pmc.ncbi.nlm.nih.gov/articles/PMC11921719/
Case study. Examines whether Elicit adds value to the systematic review process compared to traditional screening methods.Authors compare results from one (1) umbrella review with results of the AI-based searching using the same criteria. Articles obtained with Elicit were reviewed using the same inclusion criteria as the umbrella review. Reliability was assessed by comparing the number of publications against human conducted studies. Findings suggest AI research assistants, like Elicit, serve as valuable complementary tools when designing or writing SRs. However, AI tools have several limitations and should be used with caution. When using AI tools, certain principles must be followed to maintain methodological rigour and integrity. Improving the performance of AI tools such as Elicit and contributing to the development of guidelines for their use during the systematic review process will enhance their effectiveness. (Dean: this method might be helpful for research. Elicit can be used throughout the entire systematic review process to assist researchers with specific tasks. See also: Elicit AI Review: Your Best Research Tool to Use in 2025 https://www.fahimai.com/elicit-ai)
4. Blasingame MN, Koonce TY, Williams AM, Giuse DA, Su J, Krump PA, Giuse NB. Evaluating a Large Language Model's Ability to Answer Clinicians' Requests for Evidence Summaries. medRxiv [Preprint]. 2024 May 3:2024.05.01.24306691. https://pubmed.ncbi.nlm.nih.gov/38746273/
Investigates performance of GPT-4 to answer clinical questions vs medical librarians’ gold-standard evidence syntheses. Questions were extracted from an in-house database of clinical evidence requests answered by medical librarians. A standard prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internal chat tool using GPT-4, and recorded responses. The summaries by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat. Of 216 evaluated questions, aiChat’s response was assessed as “correct” for 180 (83.3%) questions, “partially correct” for 35 (16.2%) questions, and “incorrect” for 1 (0.5%) question. For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as non fabricated. (Dean: performance of a generative AI tool was promising. However, many included references could not be verified, and did not assess whether any additional concepts introduced by aiChat were factually accurate.)
5. Bolanos F, Salatino A, Osborne F, Motta E. Artificial intelligence for literature reviews: Opportunities and challenges. Artificial Intelligence Review. 2024 Aug 17;57(10):259. https://link.springer.com/article/10.1007/s10462-024-10902-3
Dean: This is a good read. Authors review use of AI in literature reviews but mostly in terms of screening papers. Numerous tools have been developed to assist and partially automate SRs. The increasing role of AI in this field shows great potential in providing support for researchers, moving towards semi-automatic creation of literature reviews. Focuses on how AI techniques are applied in semi-automation of SLRs, re: screening and extraction phases. Examines 21 leading SLR tools using a framework that combines 23 traditional features with 11 AI features. We also analyse 11 recent tools that leverage large language models for searching the literature and assisting academic writing. Dean: see 7.1 section for Elicit information.
Dean: Librarians looked at whether an LLM, ChatGPT, could generate a systematic review search strategy. Mixed results...any review published using only AI search strategies would clearly have major issues.. "...Given the time-savings may not materialize due to the need for expert oversight of model output and possible impact on people and population’s health, taken in concert with the poor reproducibility and recall found in these results, how much autonomy are we willing to bestow upon these tools?"
Dean: This 2024 paper examines automating process of comprehensive searching in biomedical databases to obtain all relevant articles on a certain topic. Introduces dataset developed to facilitate automated techniques for searching. Provides and analyzes a set of basic methods by using a number of generative models, and reports their results. Proposes a simple but effective model based on ChatGPT for generating Boolean queries in PubMed; the model is more effective than basic search models, than keyword searching in PubMed, and existing methods for crafting Boolean queries using ChatGPT. Dean: the authors show that the model is more effective than manual queries in terms of Precision and Recall but falls short of the recall that manual queries achieve at position 1000.
8. Chen XS, Feng Y. Exploring the use of generative artificial intelligence in systematic searching: a comparative case study of a human librarian, ChatGPT-4 and ChatGPT-4 Turbo. IFLA Journal. 2024 Jul 21:03400352241263532. https://journals.sagepub.com/doi/10.1177/03400352241263532
Uses a comparative case study approach to examine the search-term-generation and article-retrieval capabilities among a human librarian, ChatGPT-4, and a ChatGPT-4 Turbo customized AI-Librarian Bot. This study advocates for a synergistic model where AI augments the systematic-review process, complementing the depth and nuance provided by human expertise in achieving accurate and comprehensive research outcomes. (Dean: a small case study and frames ChatGPT as a supplement to other tools. I like the phrase “human librarian”.)
9. Đukić M, Škembarević M, Jejić O, Luković I. Towards the Utilization of AI-Powered Assistance for Systematic Literature Review. In: European Conference on Advances in Databases and Information Systems 2025 (pp. 195-205). Springer, Cham. https://link.springer.com/chapter/10.1007/978-3-031-70421-5_16
Evaluates efficacy of two GPT-based tools, Elicit and SciSpace, in automating SLR process.Authors assess their usability and reliability in two SLR steps: literature search and citation screening. Examine the benefits (accuracy? Efficiency? one search tool and interface integration; time-savings) and limitations (reproducibility) of these tools. Elicit and SciSpace offer significant assistance in literature retrieval and study selection, but their integration with human expertise is essential to ensure the thoroughness and accuracy of the review process.
Dean: this paper has just been published; once I have read it over, I will make comments. It's a novel study for the tools it examines: Lens.org, SpiderCite, and Microsoft Copilot.Canada's Drug Agency conducted a research project involving a lit review, comparative analysis, and focus group on AI or automation tools for information retrieval: Lens.org, SpiderCite, and Microsoft Copilot. Retrieval practices at CDA served as their reference standard and eligible studies of 7 completed projects used to measure tool performance. For searches conducted with their usual approaches and with each of the 3 tools, they calculated sensitivity/recall, number needed to read (NNR), time to search and screen, unique contributions, and likely impact of the unique contributions on the projects’ findings. The investigation confirmed that AI search tools have inconsistent and variable performance for the range of information retrieval tasks performed at Canada's Drug Agency. Implementation recommendations from this study informed a “fit for purpose” approach where Information Specialists leverage AI search tools for specific tasks or project types.
11. Fenske RF, Otts JA. Incorporating Generative AI to Promote Inquiry-Based Learning: Comparing Elicit AI Research Assistant to PubMed and CINAHL Complete. Medical Reference Services Quarterly. 2024 Oct 1;43(4):292-305. https://pubmed.ncbi.nlm.nih.gov/39495550/
Dean:This study identified an effective strategy to integrate GenAI in the course design to promote IBL. A descriptive study design was used for graduate nursing students to compare the effectiveness of a GenAI literature search tool, Elicit: The AI Research Assistant, to PubMed and CINAHL. Students identified the strengths (pros) and weaknesses (cons) of each tool and determined which tool was more effective in terms of accuracy, relevance and efficiency. Dean: study is relevant to our teaching of AI search tools.
12. Friesen E, Roy A. Better than a Google Search? Effectiveness of Generative AI Chatbots as Information Seeking Tools in Law, Health Sciences, and Library and Information Sciences. Health Sciences, and Library and Information Sciences (July 31, 2025). 2025 Jul 31. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5402185
Generative AI chatbots are reshaping information-seeking behaviours due to their ability to cite online sources in responses to user queries. University students are turning to chatbots as learning partners; perceived trust in these tools speaks to the importance of the quality of the sources cited when they are used as an information retrieval system. This study investigates the source citation practices of five widely available chatbots - ChatGPT, Copilot, DeepSeek, Gemini, and Perplexity across three academic disciplines - law, health sciences, and library and information sciences. Using 30 discipline-specific prompts grounded in the respective professional competency frameworks, the study evaluates source types, organizational affiliations, the accessibility of sources, and publication dates. Results reveal major differences between chatbots, which cite consistently different numbers of sources, with Perplexity and DeepSeek citing more and Copilot providing fewer, as well as between disciplines, where health sciences questions yield more scholarly source citations and law questions are more likely to yield blog and professional website citations. Paywalled sources and discipline-specific literature such as case law or systematic reviews are rarely retrieved. These findings highlight inconsistencies in chatbot citation practices and suggest discipline-specific limitations that challenge their reliability as academic search tools.
Dean: Elicit did not search with high enough sensitivity to replace traditional literature searching. However, the high precision of searching in Elicit could prove useful for preliminary searches, and the unique studies identified mean that Elicit can be used by researchers as a useful adjunct. Further evaluations should be undertaken as new developments take place.
14. Guimarães NS, Joviano-Santos JV, Reis MG, Chaves RR, Observatory of Epidemiology, Nutrition, Health Research (OPENS). Development of search strategies for systematic reviews in health using ChatGPT: a critical analysis. Journal of Translational Medicine. 2024 Jan 2;22(1):1. https://link.springer.com/article/10.1186/s12967-023-04371-5
Discusses a pilot study of search strategies created by ChatGPT. The search strategies created by ChatGPT do not organize, in a correct manner, the groups of acronyms in the same search key. (Dean’s note: the authors recommend caution for building information search strategies using ChatGPT exclusively. Despite being a simple-to-run tool and having ease in response, content and structuring problems are reported and searchers should be aware.)
...We applied AI tools—Scite and Undermind—in the context of a realist review to facilitate the identification of relevant studies. Seed papers and key informant papers guided the search, and a novel classification system (grandparent, parent, and child papers) was used to systematically organise studies for developing and refining theoretical constructs. Transparent screening procedures and decision-making frameworks were employed to ensure methodological rigour and reproducibility..."
16. Jin Q, Leaman R, Lu Z. PubMed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine. 2024100doi:10.1016/j.ebiom.2024.104988 https://pubmed.ncbi.nlm.nih.gov/38306900/
Authors present 30 literature search tools tailored to common biomedical use cases. Describes AI-based search tools catering to five specific information needs: 1. Evidence-based medicine. 2. Precision medicine and genomics. 3. Searching by meaning, including questions. 4. Finding related articles with literature recommendations. 5. Discovering hidden associations through literature mining. Authors discuss the impact of LLMs such as ChatGPT on biomedical information seeking. It is hard to find a suitable tool to efficiently fulfill users’ information needs. Future work should utilise AI techniques, especially large language models, to automatically triage the information needs of users and provide them the right tool to use.
16. Khraisha Q, Put S, Kappenberg J, Warraitch A, Hadfield K. Can large language models replace humans in systematic reviews? Evaluating GPT‐4's efficacy in screening and extracting data from peer‐reviewed and grey literature in multiple languages. Research Synthesis Methods. 2024 Jul;15(4):616-26. https://onlinelibrary.wiley.com/doi/abs/10.1002/jrsm.1715
Dean: Pre-registered study uses a “human-out-of-the-loop” approach to evaluate GPT-4's capability in title/abstract screening, full-text review and data extraction across various literature types and languages. Although GPT-4 had accuracy on par with human performance in some tasks, results were skewed by chance agreement and dataset imbalance. Adjusting for these caused performance scores to drop across all stages: for data extraction, performance was moderate, and for screening, it ranged from none in highly balanced literature datasets (~1:1) to moderate in those datasets where the ratio of inclusion to exclusion in studies was imbalanced (~1:3). Dean: in screening full-text literature using highly reliable prompts, GPT-4's performance was more robust, reaching “human-like” levels. (Dean: caution should be exercised if LLMs are used to conduct systematic reviews; still, they offer some evidence that, for certain review tasks delivered under specific conditions, GPT4 may be helpful.)
17. Levay P, Craven J. Systematic Searching in a Post-Pandemic World: New Directions for Methods, Technology, and People. Evidence Based Library and Information Practice. 2023;18(4):93-104.
Dean: another very dated paper already. Worth a read for background.Can we use artificial intelligence (AI) to generate search strategies? Text-generation systems are already being rolled out to question-answering services in familiar search engines, such as Bing and Google. We are now seeing attempts to apply generative AI to evidence synthesis with mixed results (Qureshi et al., 2023). ChatGPT-3.5, launched in November 2022, can generate seemingly plausible PubMed strategies, if prompted with the right question (Wang et al., 2023). (Dean: the authors point out that AI strategies won't pass peer-review, as they can contain serious errors, such as subject headings that do not actually exist in MeSH (Wang et al., 2023)<.
18. Lieberum J-L, Töws M, Metzendorf M-I, Heilmeyer F, Siemens W, Haverkamp C, Böhringer D, Meerpohl JJ, Eisele-Metzger A, Large language models for conducting systematic reviews: on the rise, but not yet ready for use – a scoping review, Journal of Clinical Epidemiology (2025).https://www.jclinepi.com/article/S0895-4356(25)00079-4/fulltext
Dean: This is an excellent paper.".. LLMs should be used with caution and limited to specific SR tasks under human supervision. Despite the apparent technical simplicity of implementing LLMs in literature searching, their evaluation – authors rated half of the approaches as non-promising – seems to show a range of limitations. LLMs seem to support in study selection and data extraction and appeared more favorable, with by far most (study selection) or at least a slight majority (data extraction) of the described application forms rated as promising, and the rest categorized as neutral." Dean: a key paper delivered by outstanding librarian colleagues. SR lit searching n=15; 41% of 37 papers; searching most frequent SR step aided by LLMs; also, most reported as non-promising (n=8); rated as non-promising.
Dean: This is another clinician-led study. Tested Scite and Elicit to search for articles on “Glaucoma, pseudoexfoliation and Hearing Loss” comparing results with previously human-conducted SLR. Then we used Elicit and ChatPDF to assess their capability to extract and organize key information from scientific articles. Conference paper highlights potential and limitations of AI in SLR development. Neither Scite nor Elicit were able to provide results found in the human-based PRISMA method. Elicit’s ability to process and summarize information in tables is time-saving, though accuracy is not assured. ChatPDF gives reliable information and may be beneficial for SLR writing. Note: participation of human researchers remains crucial to maintain control over quality, accuracy, and objectivity of their work.
The Canada’s Drug Agency (CDA-AMC) developed a process to evaluate AI search tools. It inventoried 51 tools in 2023, established selection criteria, assessed specific attributes, and built an instrument to support monitoring. Rapid development of AI search requires an instrument to inform adoption and enable comparison. This work enabled the development of a flexible instrument to evaluate novel AI search tools for evidence synthesis continually. (Dean: no mention of Open Evidence https://www.openevidence.com/ or Undermind https://www.undermind.ai/home/)
Large language models (LLMs) are transforming the way evidence is retrieved by converting natural language prompts into quick, synthesized outputs. These platforms reduce the time required for literature searches, making them more accessible to users unfamiliar with formal search strategies. An evaluation of four prominent platforms— Undermind.ai, Scite.ai, Consensus.app, and Open Evidence —highlights advantages and ongoing limitations. Undermind and Consensus utilize the extensive Semantic Scholar database of over 200 million records, Scite enhances results with “Smart Citations” that indicate supportive or opposing references, and Open Evidence applies a medically-focused LLM trained on licensed sources, including the complete NEJM archive. Despite their benefits, key limitations persist: opaque algorithms, inconsistent responses to identical queries, paywalls or sign-up barriers, and incomplete recall that may compromise systematic reviews. To support critical appraisal, we outline essential information-retrieval metrics—including recall, precision, F1-score, mean average precision, and specificity—and provide open-source code. Until validated, transparent evaluations demonstrate consistently high recall, these tools should be viewed as rapid, first-pass aids rather than replacements for structured database searches required by PRISMA-compliant methodologies.(Dean’s note: authors rightly point out that the lack of transparent, reproducible search logs (and consistently high recall are guaranteed, researchers should treat LLM-based outputs as a helpful first pass, followed by conventional database searches to ensure completeness and compliance with PRISMA, and scientific integrity.)
22. Orgeolet L, Foulquier N, Misery L, Redou P, Pers JO, Devauchelle-Pensec V, Saraux A. Can artificial intelligence replace manual search for systematic literature? Review on cutaneous manifestations in primary Sjögren's syndrome. Rheumatology (Oxford). 2020 Apr 1;59(4):811-819. https://academic.oup.com/rheumatology/article/59/4/811/5557823
Compares manual searching and searching with the in-house computer software BIbliography BOT (BIBOT) designed for article retrieval and analysis. In the final selection of 202 articles, 155/202 (77%) were found by the two methods but BIBOT was faster and automatically classified the articles in a chart. Combining the two methods retrieved the largest number of publications. (Dean’s note: manual searching with the AI combined seems better/optimal.)
23. Parisi V, Sutton A. The role of ChatGPT in developing systematic literature searches: an evidence summary. Journal of EAHIL. 2024 Jun 27;20(2):30-4. https://ojs.eahil.eu/JEAHIL/article/view/623
Dean: Two librarians summarize whether ChatGPT is acceptable in developing systematic searches.Paper explores potential and limitations of using ChatGPT for developing systematic literature searches. A search identified current peer-reviewed and grey literature. Studies were selected according to eligibility criteria. Included studies were analyzed and synthesized focusing on the strengths, limitations, and recommendations for using ChatGPT to assist with systematic literature searching. The literature is mostly opinion-driven with limited published literature originating from the library and information profession. (Dean’s note: At present, limitations outweigh strengths of ChatGPT for systematic literature searching. See also: Sutton A, Parisi V. ChatGPT: Game-changer or wildcard for systematic searching?. Health Info Libr J. 2024;41(1):1-3.)
24. Park SG. AI and Systematic Reviews: Can AI Tools Replace Librarians in the Systematic Search Process?. Science & Technology Libraries. 2025 Jun 26:1-22.
(Dean’s note: This is a comprehensive examination of seven AI tools including Consensus and Elicit. A bit long, but the conclusions are correct that AI tools are hyped, and under-deliver, and librarians will want to understand the tools while advising their SR users.) Systematic reviews require a transparent and exhaustive search to address specific research questions. Librarians, recognized for their expertise in search strategies and databases, are frequently asked to join research teams to formulate and implement comprehensive search strategies. The use of automation tools and technologies in literature searches is not new; however, the adaptation of recent artificial intelligence (AI)-based research tools has rapidly transformed the work of systematic reviews from forming research questions to reporting findings. By investigating the search feature of the systematic review process, which is critical for a librarian’s role, this paper investigates the current status of AI-based search tools used and recommended for systematic reviews and broader literature reviews. Focusing on the search stage of the systematic review process, the author examines the features and viability of select AI-based tools, evaluates their integration into existing systematic review workflows, and addresses issues related to transparency, reproducibility, and trustworthiness. The study also assesses whether AI tools can be incorporated into systematic review processes and discusses the evolving roles and responsibilities of librarians in using these technologies.
As artificial intelligence (AI) becomes increasingly integrated into the generation and delivery of information, new tools are emerging to assist researchers and clinicians with discovery, search, and reviews. These tools, such as Elicit, Scite, and Research Rabbit, promise to revolutionize academic research by offering superhuman speed in analyzing papers, building citation chains, and writing reviews. Despite these advancements, trust in AI remains low among researchers. To maximize the benefits of AI while mitigating risks, researchers should adopt a hybrid approach that combines AI-generated insights with traditional scholarly methods. Ongoing assessment and comparison of AI tools will be crucial in maintaining academic standards and ensuring ethical responsibility as the technology evolves.
26. https://www.scielo.br/j/rdbci/a/GdtF5jnYMshLsy5gr8yybsj/?lang=en Picalho AC, Oliveira GR, Cativelli AS. Artificial intelligence in bibliographic searches in scientific databases: comparing search expressions in ChatGPT, Copilot, and Gemini. RDBCI: Revista Digital de Biblioteconomia e Ciência da Informação. 2025 May 12;23:e025013.]
"...These tools are currently valuable allies for librarians and researchers in repetitive and corrective tasks. However, the conducted tests indicate that the tools alone do not generate highly sensitive search expressions, requiring supervision and adjustments before the strategy is implemented."
27. Qureshi R, Shaughnessy D, Gill KAR, Robinson KA, Li T, Agai E. Are ChatGPT and large language models "the answer" to bringing us closer to systematic review automation? Syst Rev. 2023 Apr 29;12(1):72. https://pubmed.ncbi.nlm.nih.gov/37120563/
Using a ChatGPT generated PubMed search strategy, or an initial draft, would be helpful to those who may not have access to an informationist (librarian) in their resources. However, ChatGPT’s search strategy was unusable with multiple issues, including fabricating controlled vocabulary, that would not be apparent without expertise in search construction. (Dean’s note: authors suggest ChatGPT is an acceptable first step in talking through SR searching before consulting a librarian - however, not everyone has access to one.)
28. Sami AM, Rasheed Z, Kemell KK, Waseem M, Kilamo T, Saari M, Duc AN, Systä K, Abrahamsson P. System for systematic literature review using multiple ai agents: Concept and an empirical evaluation. arXiv preprint arXiv:2403.08399. 2024 Mar 13. https://arxiv.org/abs/2403.08399
Proposes a model that operates through a user-friendly interface where researchers input their topic, and in response, the model generates a search string used to retrieve relevant academic papers. (Dean’s note: the search strings are not as good, in my view, as the ones generated by PubMed automatically when you use keyword search.)
29. Seth I, Lim B, Xie Y, Ross RJ, Cuomo R, Rozen WM. Artificial intelligence versus human researcher performance for systematic literature searches: a study focusing on the surgical management of base of thumb arthritis. Plastic and Aesthetic Research. 2025 Jan 6;12:N-A. https://www.oaepublish.com/articles/2347-9264.2024.99?to=comment
Dean: This is led by clinicians. This study examined AI search engines (Elicit, Consensus, ChatGPT) vs. manual search for literature retrieval, focusing on surgical management of trapeziometacarpal osteoarthritis. Findings highlight the advantages and drawbacks of AI search engines for literature searches. While Elicit was prone to error, Consensus and ChatGPT were less comprehensive. Significant enhancements in the precision and thoroughness of AI search engines are required before they can be effectively utilized in academia. Dean: case study examines comparative performance between human-initiated and AI-initiated literature searches. AI platforms have poor proficiency especially ChatGPT, which performed poorly across all domains and outcomes. Although Elicit came the closest to mimicking human precision of the initial search, manual searches were far superior to all AI literature search engines in terms of the number of studies identified and their specificity to the subject.
Dean: This paper is often cited as a good comparison between AI and human extraction from documents. It explores use of generative AI tools to reliably extract qualitative data from peer-reviewed documents.Evaluates capacity of multiple AI tools to analyse literature and extract relevant information for a systematic literature review, comparing the results to those of human reviewers. Authors address how well AI tools can discern the presence of relevant contextual data, whether the outputs of AI tools are comparable to human extractions, and whether the difficulty of question influences the performance of the extraction. While the AI tools tested (GPT4-Turbo and Elicit) were not reliable in discerning the presence or absence of contextual data, at least one of the AI tools consistently returned responses that were on par with human reviewers. Highlights AI tools in the extraction phase of evidence synthesis for supporting human-led reviews and underscore the ongoing need for human oversight. (Dean: this might be a good paper to model for Elicit or Undermind-based research.)
31. Sutton A, Parisi V. ChatGPT: Game-changer or wildcard for systematic searching? Health Info Libr J. 2024 Mar;41(1):1-3. doi: 10.1111/hir.12517. https://pubmed.ncbi.nlm.nih.gov/38418378/
Dean: This paper seems a bit thin in content.UK health librarians discuss taking a course about using ChatGPT in reviews. The course covered using AI for the whole systematic review process. The main uses covered for literature searching were suggesting search terms, citation chaining (apparently ChatGPT is pretty good at this) and translating a search into inclusion/exclusion criteria (although generally we would do this the other way around). The course providers likened it to having a conversation with a research ass66istant, the better you train it, the better it works. (Dean’s note: this paper reports on a librarian-led course for using ChatGPT.)
32. Tomassini F, Luraschi A, Patarnello S, Masciocchi C, Arcuri G, Lilli L. Leveraging ChatGPT-4 for Evidence Synthesis: A Case Study on the Use of a Large Language Model in a Systematic Review. Stud Health Technol Inform. 2025 Oct 2;332:22-26. doi: 10.3233/SHTI251488. PMID: 41041739. https://ebooks.iospress.nl/doi/10.3233/SHTI251488
ChatGPT, can effectively enhance the efficiency of systematic reviews, but it must be used under expert supervision to ensure accuracy, reproducibility, and quality. AI should be viewed as a support tool rather than a replacement for human judgment. When managed properly, the synergy between LLMs and expert oversight can significantly improve the speed and accessibility of SLR without compromising their integrity. Future works will focus on harmonizing our findings with HTA and European standards. Additionally, it will be essential to establish a rigorous framework for LLM performance evaluation and address the persistent challenges of low reproducibility and lack of comparative analysis with existing tools.
32. Tomczyk P, Brüggemann P, Mergner N, Petrescu M. Are AI tools better than traditional tools in literature searching? Evidence from E-commerce research. Journal of Librarianship and Information Science. 2024 Nov 15:09610006241295802. https://journals.sagepub.com/doi/abs/10.1177/09610006241295802
Dean: This is a bit long, and not always relevant. It examines the potential implications of Artificial Intelligence (AI) for literature search, comparing AI-based tools to conventional research methods. Poses four questions re: accuracy, quality, uniqueness, and qualified uniqueness. Employing Algorithmic Theory and Data Dependency Theory, scrutinizes AI performance in algorithms, machine learning models, and data quality. Tests nine e-commerce topics using Scopus, Web of Science, Elicit, and SciSpace - while conventional methods excel in accuracy and quality, AI tools show promise in uniqueness, complementing literature reviews. The findings emphasize the integration of AI tools and advocate for research into new applications and diverse fields. Dean: this research offers insights into leveraging AI tools to enhance conventional literature search practices in research and professional domains.
AI-driven literature retrieval tools, such as Elicit, ResearchRabbit, and Consensus, claim to accelerate the initial phases of systematic searching by replacing traditional Boolean logic with natural language queries, enhancing efficiency and accessibility. By offering an intuitive and user-friendly interface, these platforms promise to automate search processes, reducing the need for manual query formulation. However, their reliance on open-access sources and pre-indexed content limits their ability to comprehensively retrieve relevant studies. Unlike traditional search engines that provide real-time access to proprietary databases such as PubMed, Cochrane, and Embase, AI-assisted retrieval tools are often constrained by paywalls and subscription-based repositories, frequently missing key studies. AI-generated Boolean queries lack a critical feature of systematic search: reproducibility. Identical prompts can yield variable outputs, even when controlling for database specifications, making it difficult to verify or replicate search results. This inconsistency undermines a foundational principle of systematic searching, where transparent and repeatable methods are essential for ensuring validity.
Dean: This is another example of a paper that seems old.A 2022 general paper outlines ways AI can support researchers in the searching and screening tasks of lit reviews; cites TheoryOn (Li et al., 2020) for ontology-based searches for constructs and construct relationships in behavioral theories; Litbaskets (Boell and Wang, 2019) supports researchers setting a manageable scope in terms of journals covered; LitSonar (Sturm and Sunyaev, 2018) offers syntactic translation of search queries for different databases. (Dean’s note: an early paper; authors do their best but it’s not clear whether research is relevant at this point for most researchers.)
35. Wang S, Scells H, Koopman B, et al.. Can ChatGPT write a good Boolean query for systematic review literature search? https://arxiv.org/abs/2302.03495.
Dean: This paper was valuable when it came out.It investigates AI models, ChatGPT, in generating Boolean queries for SR searching. The ability of ChatGPT to follow instructions and generate queries with high precision makes it a valuable tool for researchers conducting systematic reviews, particularly for rapid reviews where time is a constraint and often trading-off higher precision for lower recall is acceptable. (Dean’s note: I would not recommend using ChatGPT in any situation involving time constraints as it is too misleading. Use it if you have time for browsing, testing and having a conversation with a robot.)
Dean: This paper seems almost ancient now.Numerous articles decry the potential of college students using AI to cheat on exam questions, discussion board posts, and research papers. Yet, through use of language learning models (LLM), AI may be used as a tool to increase efficiency, perform repetitive tasks, and aid with research and analysis. It is through the lens of AI as a tool, not as a concern or replacement of human involvement, that the authors examined Elicit.org, a literature search tool that uses LLM to aid the research process.
Column examines SciSpace, Semantic Scholar, Elicit, Google Scholar, Research Rabbit, PubMed and CAB Abstracts; provides analysis of systems’ interfaces and outputs. A veterinary topic was selected to test AI tools in searching: colic AND horses AND microbiome in each of the AI tools (SciSpace, Semantic Scholar, Elicit, Research Rabbit and Google Scholar) and databases (PubMed and CAB Abstracts). Authors phrased the search in a question, “What is the relationship between colic and the microbiome in horses?” since they were designed to answer questions rather than execute Boolean searches. The authors downloaded about 100 records from each of the tools and exported them into the reference manager EndNote and then searched for duplicates among all the results, allowing us to calculate the number of references shared by each tool with at least one other tool versus the number of unique references.
Dean's note: This paper reveals many of the limitations of ChatGPT applied to literature searching. Each literature search scenario posed different challenges: an abundance of secondary information sources in high interest topics, and uncompelling literature for new/niche topics. Authors tested practical examples highlighting both the potential and the pitfalls of integrating conversational AI into literature search processes, and underscores the necessity for rigorous comparative assessments of AI tools in scientific research.
Citation-based literature search tools (Litmaps, Connected Papers and Research Rabbit). These use the references between the papers to discover new literature. For example, if 10 papers in your collection all reference the same paper that is not in your collection, these tools would suggest that. Semantics-based literature search tools (Consensus, SciSpace, Elicit). These tools use an AI model like ChatGPT to analyze the abstracts of papers (and sometimes full-text if open-access) and find papers using a plain text query.
Other relevant background
Alshami A, Elsayed M, Ali E, Eltoukhy AE, Zayed T. Harnessing the power of ChatGPT for automating systematic review process: Methodology, case study, limitations, and future directions. Systems. 2023 Jul;11(7):351. https://www.mdpi.com/2079-8954/11/7/351
Blum M. ChatGPT Produces Fabricated References and Falsehoods When Used for Scientific Literature Search. J Card Fail. 2023;29(9):1332-1334. https://pubmed.ncbi.nlm.nih.gov/37406729/
Cox AM. The impact of AI, machine learning, automation and robotics on the information professions: A report for CILIP. CILIP: the Library and Information Association. 2021. https://www.cilip.org.uk/page/researchreport
Hill JE, Harris C, Clegg A. Methods for using Bing's AI‐powered search engine for data extraction for a systematic review. Research Synthesis Methods. 2024 Mar;15(2):347-53. Abstract:Natural language processing artificial intelligence techniques have the potential to automate data extraction, save time and resources, accelerate the review process, and enhance the qualiity of extracted data. Proposes method for using Bing AI and Microsoft Edge as a second reviewer to verify data items extracted by a single human reviewer.
MacFarlane A, Russell-Rose T, Shokraneh F. Search strategy formulation for systematic reviews: Issues, challenges and opportunities. Intelligent Systems with Applications. 2022 Sep 1;15:200091. Abstract:Methods used to construct search strategies can be complex, time consuming, resource intensive and error prone. The authors examine the state of the art in resolving complex structured information needs. They analyse the literature to identify key challenges and issues and explore appropriate solutions and workarounds. They propose a way forward to facilitate trust and to aid explainability and transparency, reproducibility and replicability through a set of key design principles for tools to support the development of search strategies in systematic literature reviews.
Nack A, Benavent D. AI in medical research: boosting discovery or weakening critical search skills? ARP Rheumatol. 2025 Apr-Jun;4(2):76-79. https://pubmed.ncbi.nlm.nih.gov/40629816/Great editorial written by physicians.
Rather than rejecting AI, the solution may lie in integrating it critically into medical workflows. Success depends on introducing digital-scholarship competencies (algorithmic literacy, critical appraisal of machine output and data‑governance principles) early in medical training. Researchers must learn to validate AI-generated content against primary sources to ensure accuracy and avoid misinformation. Training should promote hybrid search strategies that combine Boolean logic with AI-powered tools to reduce bias and improve coverage. Equally essential is systematic instruction on algorithmic bias and model provenance, enabling future clinicians and scientists to interrogate opaque ranking systems and preserve methodological rigour in evidence synthesis. Journals and funding bodies should, in parallel, require transparent disclosure of AI assistance in literature searches, reinforcing reproducibility and sustaining accountability across the research process.
Oeding JF, Lu AZ, Mazzucco M, Fu MC, Taylor SA, Dines DM, Warren RF, Gulotta LV, Dines JS, Kunze KN. ChatGPT-4 Performs Clinical Information Retrieval Tasks Using Consistently More Trustworthy Resources Than Does Google Search for Queries Concerning the Latarjet Procedure. Arthroscopy. 2025 Mar;41(3):588-597.
Oelen A, Jaradeh MY, Auer S. ORKG ASK: A Neuro-symbolic Scholarly Search and Exploration System. arXiv preprint arXiv:2412.04977. 2024 Dec 6. Abstract: Finding scholarly articles is a time-consuming and cumbersome activity, yet crucial for con- ducting science. Due to the growing number of scholarly articles, new scholarly search systems are needed to effectively assist researchers in finding relevant literature. Dean: really interesting background.
Russell-Rose T. Rethinking ‘Advanced Search’: an AI-based approach to search strategy formulation. In: The Human Position in an Artificial World: Creativity, Ethics and AI in Knowledge Organization 2019 June 26 (pp. 275-290). Ergon-Verlag.
Wildgaard LE, Vils A, Sandal Johnsen S. Reflections on tests of AI-search tools in the academic search process. LIBER Quarterly: The Journal of the Association of European Research Libraries.2023;33(1). https://liberquarterly.eu/article/view/13567
AI TOOLS & CHATBOTS
A fewl tools I'm testing and seem most useful in biomedical domain:
Consensushttps://consensus.app/ uses AI to distill findings from scientific research "reading" papers and extract key results.
Dimensions AI https://www.dimensions.ai/ provides free access to over 100 million publications and preprints to help you find papers; it shows the context - with citations, news and social media mentions, and links to funded grants and patents.
Note: Please use your critical reading skills while reading entries. No warranties, implied or actual, are granted for any health or medical search or AI information obtained while using these pages. Check with your librarian for more contextual, accurate information.