Large language models (LLMs)

Compiled by

Dean Giustini, UBC Biomed librarian, dean.giustini@ubc.ca

Updated

3 January 2026 | Part of Knowledge Synthesis (KS) & AI Search Wiki 2026 & A to Z Listing

Background

Note: OpenAI leads the general AI space, but AI companies are developing deep research tools and experimenting with AI-powered academic searching in support of research. Perhaps you have faculty or students asking you to present these tools to classes. For more information, see Which companies are behind AI search tools?.
How will AI tools affect our traditional bibliographic databases? Will we see GenAI being put into our search platforms? Can we stop it?
ChatGPT by OpenAI | Machine learning | Neural networks | Prompt engineering | Reasoning models
Also: This open textbook (or wiki channel) is intended to help librarians and other information professionals learn about AI. It is not, in itself, meant to be seen as promotion of AI. If anything, the goal is harms mitigation or harms reduction.

Introduction

"...The idea that we should outsource academic authorship to LLMs rests on the assumption that writing is (only) a mechanical, predictable or reductive process which, with the right prompts, can be replicated with ease." — Masters, 2025.

Large language models (LLMs) are artificial intelligence (AI) programs that aim to understand, generate, and predict human language. LLMs are trained on massive datasets of text and code, some/most of it copyrighted, and are seemingly able to carry out various natural language processing tasks, text generation, translation, and summarization. LLMs are based on deep learning models, particularly using transformer architecture, which aims to process and understand the relationships between words and phrases in a sequence.

The largest LLMs are generative pretrained transformers (GPTs), used in chatbots such as ChatGPT by OpenAI, Gemini (Google) and Claude (see List of LLMs). LLMs are fine-tuned for tasks or guided by prompt engineering. LLMs acquire predictive power regarding syntax, semantics, and ontologies inherent in human language data, but inherit inaccuracies, misinformation and biases in the data they are trained on.

A recent study revealed that "...participants developed "shallower knowledge" from LLM summaries even when real-time web links augmented the results...". See https://academic.oup.com/pnasnexus/article/4/10/pgaf316/8303888 In other words, when individuals learn about a topic from LLM syntheses, they risk developing shallower knowledge than when they learn through standard web search, even when the core facts are the same. This shallower knowledge accrues from an inherent feature of LLMs—the presentation of results as summaries of vast arrays of information rather than individual search links—inhibits users from actively discovering and synthesizing information sources themselves.

Tool Name and Description	Best Used For
ChatGPT 3.5 ChatGPT by OpenAI is an AI language model designed to process and generate human-like responses, assisting with diverse academic tasks and communication.	Creating human-sounding answers to questions about library work.
Stability AI (Stable Diffusion) Stable Diffusion is a deep learning text-to-image model released in 2022 based on diffusion techniques; used to generate detailed images conditioned on text descriptions, though can be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt.	Creating images from text prompts.
Llama Chat Llama is an open source Large Language Model. Created by Meta (Facebook’s parent company).	An open source alternative to ChatGPT.

Examples

GPT models (i.e., such as ChatGPT by OpenAI) are trained using supervised learning (a type of machine learning) to predict the next word in a sequence from massive text datasets (e.g., Wikipedia, books, the web, news sources). GPT-4.5 and Gemini 2.0 Flash (released February 2025) process text, images, and code using ML techniques to integrate data. GPT-4.5 uses transfer learning to adapt pre-trained text models for image-based tasks like generating alt text from photos.

Transfer learning in large language models (LLMs) is the process LLMs use to apply their prior knowledge to new tasks. TL is a machine learning technique in which knowledge gained through one task or dataset is used to improve model performance on another related task and/or different dataset. In other words, transfer learning uses what has been learned in one setting to improve generalization in another setting.

Presentation

Key Building Blocks of Large Language Models (LLMs)

Collect vast amounts of data (training data)
Transform data into tokens (parts of words) → trained on billions of tokens
Unsupervised learning: training the model (weeks to months). It creates a huge mathematical representation of human language.
Fine-tuning: Expose it to new tasks and give it more guidelines
Reinforcement learning with human feedback (RLHF): Use human reviewers to rate responses to improve quality. The model learns from those ratings.
Deploy to the public: GPT5 (August 2025) → “a research preview.” Learn from more people using it so it can be improved.

Ethical issues

When using any of the AI tools from Silicon Valley, it is important to be aware of ethical issues associated with them:

Large language models (LLMs) blur lines between truth and falsehood, evidence and propaganda, where evidence is weak, contested or when information is scarce. In biomedicine, AI tools will likely be used to improve efficiency of workflows, and support of political agenda.
Don't assume good intentions. LLMs can be prompted—or may decide on their own—to manipulate tasks, hide their true intentions, or otherwise “scheme” when it serves its primary goal.
AI tools scrape the work of artists and writers in developing their training set, and as a result produce derivative works of copyrighted materials.
AI tools are susceptible to bias and can often perpetuate biases that are present in their training data.
AI tools may store and share data - it is essential not to enter any private information into an AI tool.
AIs are often black boxes, which means it is not transparent where their information is coming from.
AI tools can be used to replace human labour, for example in the creation of AI-generated artwork. Labour exploitation and professional displacement: in 2023, Nature Biomedical Engineering wrote that "it is no longer possible to accurately distinguish" human-written text from text created by large language models, and that "...it is all but certain that general-purpose large language models will rapidly proliferate... It is a rather safe bet that they will change many industries over time." Goldman Sachs suggested that generative language AI could increase global GDP by 7% in the next ten years, and expose to automation 300 million jobs globally.
LLMs like ChatGPT have a significant environmental impact, see Environmental and climate-related impacts of AI-searching.
Brinkmann et al. (2023) argue that humans exhibit variation in their ethical expectations towards machines, both within and across cultures. This raises questions about how to best aggregate diverse, potentially conflicting preferences to arrive at an agreeable outcome. Recommender algorithms are altering social learning dynamics. Chatbots are forming a new mode of cultural transmission, serving as cultural models.
LLM impact on teaching and learning: a study conducted by Kejingyun et al (2025) examined the potential of LLMs as effective tools in improving the critical thinking of nursing students and promoting nursing education.

To learn more, check out the UNESCO Ethics of Artificial Intelligence Hub or the University of Oxford Institute for Ethics in AI.

Harms

Wu D, Haredasht FN, Maharaj SK, Jain P, Tran J, Gwiazdon M, Rustagi A, Jindal J, Koshy JM, Kadiyala V, Agarwal A. First, do NOHARM: towards clinically safe large language models. arXiv preprint arXiv:2512.01241. 2025 Dec 1.

"...Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a benchmark using 100 real primary-care-to-specialist consultation cases to measure harm frequency and severity from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 31 LLMs, severe harm occurs in up to 22.2% (95% CI 21.6-22.8%) of cases, with harms of omission accounting for 76.6% (95% CI 76.4-76.8%) of errors. Safety performance is only moderately correlated (r = 0.61-0.64) with existing AI and medical knowledge benchmarks. The best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0-12.5%), and a diverse multi-agent approach reduces harm compared to solo models (mean difference 8.0%, 95% CI 4.0-12.1%). Therefore, despite strong performance on existing evaluations, widely used AI models can produce severely harmful medical advice at nontrivial rates, underscoring clinical safety as a distinct performance dimension necessitating explicit measurement."

Librarian view of LLMs

Bottom line: A known challenge with LLMs is their tendency to generate incorrect or fabricated information (“hallucinations”) (Maynez et al. 2020). Retrieval augmented generation (RAG) is a promising strategy to mitigate this problem by supplementing an LLM’s parametric (trained) memory with externally retrieved documents, providing a way to generate better responses (Gao et al. 2024; Lewis et al. 2021). By combining both parametric (trained) and non-parametric (retrieved) sources, RAG reduces the frequency of hallucinations and improves interpretability, especially for information-sensitive tasks such as question answering and so-called "multi-hop reasoning" (Pham and Vo 2024; Liu et al, 2025). For health sciences librarians, LLMs might eventually support work with health professionals but their underlying processes raise concerns for anyone interested in scientific accuracy, transparency and rigour. It's important to know that librarians value searching for sources AND searching for answers. However, LLMs provide some of the second function while hiding the first. Therefore, transparency is not a strong suit of LLMs.

Information provided to you on this page is changing, so check for current information (or discuss with a librarian). The entry is intended to help librarians and other information professionals learn about AI but is not, in itself, meant to be seen as promotion of AI.

References

Adam GP, Davies M, George J, Caputo E, Htun JM, Coppola EL, Holmer H, Kuhn E, Wethington H, Ivlev I, Balk EM, Trikalinos TA. Machine Learning Tools To (Semi-)Automate Evidence Synthesis: A Rapid Review and Evidence. Rockville (MD): Agency for Healthcare Research and Quality); 2024 Dec. Report No.: 25-EHC006.
Bourgeois JP, Ellingson H. Ability of ChatGPT to Generate Systematic Review Search Strategies Compared to a Published Search Strategy. Med Ref Serv Q. 2025;31:1-13.
Brinkmann L, Baumann F, Bonnefon JF, Derex M, Müller TF, Nussberger AM, Czaplicka A, Acerbi A, Griffiths TL, Henrich J, Leibo JZ. Machine culture. Nature Human Behaviour. 2023 Nov;7(11):1855-68.
Editorial. Prepare for truly useful large language models. Nat Biomed Eng. 2023 Feb;7(2):85-86. doi: 10.1038/s41551-023-01012-6.
Gao Y, Xiong Y, Gao X, Jia K, Pan J, Bi Y, Dai Y, Sun J, Wang H, Wang H. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997. 2023 Dec 18;2(1).
Kejingyun S, Mingjun R. Randomized Controlled Study on the Impact of Problem-Based Learning Combined With Large Language Models on Critical Thinking Skills in Nursing Students. Nurse Educ. 2025 Jul-Aug 01;50(4):216-220.
Lederman H, Mahowald K. Are language models more like libraries or like librarians? Bibliotechnism, the novel reference problem, and the attitudes of LLMs. Transactions of the Association for Computational Linguistics. 2024 Sep 4;12:1087-103.
Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, Lewis M, Yih WT, Rocktäschel T, Riedel S. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems. 2020;33:9459-74.
Liu H, Wang Z, Chen X, Li Z, Xiong F, Yu Q, Zhang W. Hoprag: Multi-hop reasoning for logic-aware retrieval-augmented generation. arXiv preprint arXiv:2502.12442. 2025 Feb 18.
Masters K. Artificial intelligence and the death of the academic author. Medical teacher. 2025 Jul 16:1-3.
Maynez J, Narayan S, Bohnet B, McDonald R. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661. 2020 May 2.
Musleh A, Alryalat SA. Artificial Intelligence and Large Language Model Powered Literature Review Services. High Yield Med Rev. 2025;3(1).
Pham DK, Vo BQ. Towards reliable medical question answering: Techniques and challenges in mitigating hallucinations in language models. arXiv preprint arXiv:2408.13808. 2024 Aug 25.
Rose CJ, Bidonde J, Ringsten M, Glanville J, Berg RC, Cooper C, Muller AE, Bergsund HB, Meneses-Echavez JF, Potrebny T. Using a large language model (ChatGPT) to assess risk of bias in randomized controlled trials of medical interventions: protocol for a pilot study of interrater agreement with human reviewers. BMC Medical Research Methodology. 2025 Dec;25(1):1-1.
Suominen O. Annif: DIY automated subject indexing using multiple algorithms. LIBER Quarterly: The Journal of the Association of European Research Libraries. 2019 Jul 29;29(1):1-25.
Tonmoy SM, Zaman SM, Jain V, Rani A, Rawte V, Chadha A, Das A. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv:2401.01313. 2024 Jan 1;6.

Disclaimer

Note: Please use your critical reading skills while reading entries. No warranties, implied or actual, are granted for any health or medical search or AI information obtained while using these pages. Check with your librarian for more contextual, accurate information.