Jump to content

Chatbot Assessment Reporting Tool (CHART)

From UBC Wiki
Chatbot Assessment Reporting Tool (CHART) 2025 https://chartguideline.org/

Compiled by

Updated

See also

Introduction

Health professionals increasingly seek health information from large language models (LLMs) via chatbots, yet the nature and inherent risks and benefits of these tools are inadequately evaluated. Further, in 2025, there has been a sharp rise in studies involving chatbots indexed in MEDLINE.

  • In 2025, the Chatbot Assessment Reporting Tool (CHART) was published. CHART is a reporting guideline and structured checklist designed for studies evaluating generative-AI chatbots when used to summarize clinical evidence or to provide health advice.
  • CHART is used in instances where research papers are evaluating chatbots and their ability to provide medical guidance or health-related recommendations; thus, it guides how a study should be reported in a biomedical journal for readers (clinicians, policymakers, other researchers) to understand what was done in terms of methods and the type of chatbot.
  • CHART consists of 12 major items and 39 sub-items in order to "advance detailed, transparent reporting of chatbots".

What CHART Asks You To Report

  • AI model identification — name, version, release date, open-source or proprietary?
  • Model details — base model, fine-tuned, or otherwise modified?
  • Prompt engineering — how prompts/questions for chatbots were chosen or developed; whether there was patient/public involvement.
  • Query strategy — how queries were issued (date, location, interface, etc.).
  • Performance evaluation — what “ground truth” or reference standard was used (e.g. clinical guideline, expert judgement) and how performance was assessed (quantitative metrics, reproducibility, etc.).
  • Sample size & Data analysis — how many queries/responses were evaluated, how analysis was done, statistical methods if applicable.
  • Results & Discussion — clear presentation of alignment (or misalignment) between chatbot responses and reference standard, limitations, potential biases.
  • Transparency / Ethics / Open Science Disclosures (conflicts of interest, funding), whether ethics review was needed/obtained, availability of code/data, protocol for the study.

Why was CHART Created (Purpose & Importance)

As generative-AI chatbots (e.g. large-language models) have been used for health advice and in summarizing medical literature, early studies varied how they were conducted and reported. A lack of a reporting guideline made it hard to compare results, assess validity, or replicate them. CHART aims to standardize reporting so that: readers (clinicians, researchers, policymakers) can accurately understand what was done and how; differences between studies (models, methods, prompt design, evaluation criteria) become visible; the research community fosters transparency, reproducibility, and trust — especially important in health contexts, where wrong or misleading advice could have serious consequences.

The guideline is registered under the EQUATOR Network, which promotes reporting standards for health research.

Who Developed CHART & How

CHART was developed by a team of international, multidisciplinary stakeholders — researchers, clinicians, methodologists — via a multi-phase process: first a comprehensive systematic review; then a modified asynchronous Delphi consensus process with 531 stakeholders; then three synchronous expert-panel meetings with 48 participants; and finally pilot testing. The result is a consensus-driven, evidence-based guideline intended to reflect broad expert input.

What CHART Is and Is Not

CHART is a reporting guideline — i.e. it tells authors how to report their study so that it’s transparent and interpretable. It is designed for a specific domain: Chatbot Health Advice (CHA) studies, i.e. generative-AI chatbots summarizing clinical evidence or giving health advice.

CHART is not a “quality-scoring tool” or “critical appraisal checklist” (i.e. doesn’t tell you whether a chatbot’s performance was good or bad). It focuses on the quality and completeness of reporting, not on defining performance thresholds or evaluating clinical validity. It is not a guide for every possible setting where generative AI might be used in medicine — it was developed primarily for “single-session chatbot health advice studies,” not necessarily longitudinal clinical trials or other designs.

When to Use CHART

  • For researchers/authors: If you plan to publish a study evaluating a health-advice chatbot, using CHART helps ensure your work is credible, transparent, and usable by others.
  • For journal editors & peer reviewers: CHART gives a shared standard to assess whether chatbot-health-advice studies are reported thoroughly.
  • For clinicians / readers / policymakers / patients: Comprehensive reporting helps you interpret results responsibly — understand limitations, compare across studies, avoid overconfidence in AI-based health advice.

As use of generative AI grows in healthcare, CHART will help to set a baseline of methodological rigour and accountability, which is especially important given the risks of misinformation or unvalidated advice.

Related Chatbot Research & Initiatives

  • MAST, a suite of realistic clinical benchmarks to evaluate real-world performance of medical AI systems — https://bench.arise-ai.org/

References

  • "...Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a benchmark using 100 real primary-care-to-specialist consultation cases to measure harm frequency and severity from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 31 LLMs, severe harm occurs in up to 22.2% (95% CI 21.6-22.8%) of cases, with harms of omission accounting for 76.6% (95% CI 76.4-76.8%) of errors. Safety performance is only moderately correlated (r = 0.61-0.64) with existing AI and medical knowledge benchmarks. The best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0-12.5%), and a diverse multi-agent approach reduces harm compared to solo models (mean difference 8.0%, 95% CI 4.0-12.1%). Therefore, despite strong performance on existing evaluations, widely used AI models can produce severely harmful medical advice at nontrivial rates, underscoring clinical safety as a distinct performance dimension necessitating explicit measurement."

Disclaimer

  • Note: Please use your critical reading skills while reading entries. No warranties, implied or actual, are granted for any health or medical search or AI information obtained while using these pages. Check with your librarian for more contextual, accurate information.