• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

information for practice

news, new scholarship & more from around the world


advanced search
  • gary.holden@nyu.edu
  • @ Info4Practice
  • Archive
  • About
  • Help
  • Browse Key Journals
  • RSS Feeds

Improving the Understandability of Clinical Guidelines: Development and Evaluation of a GPT-4–Based Pipeline

Background: Difficulty in finding and understanding information in clinical guidelines contributes to medication errors. Large language models (LLMs) can simplify complex text to aid in understanding, but this approach to improving the quality of guidelines has not been investigated. However, LLMs are also known to hallucinate or generate outputs that may not align with reality. Objective: This study aimed to develop and evaluate an LLM pipeline to improve the readability of clinical guidelines while ensuring the preservation of critical content. Methods: To align LLM revisions with research evidence and enable comparison with manual editing, the National Health Service Injectable Medicines Guide (IMG) was used as a case study, to which a GPT-4–based pipeline was applied, with prompts based on user testing–derived recommendations for IMG authors. This enabled readability comparisons between various IMG guideline versions: original, manually revised, or GPT-4–revised using the user testing–derived recommendations, and fully user tested. Readability was evaluated using readability metrics and ratings from 3 expert pharmacists. Content similarity before and after LLM revision was assessed using BERT (bidirectional encoder representations from transformers) scores and expert pharmacist review. Results: Considering 20 IMG guidelines used in practice, BERT scores indicated high semantic similarity between the original and LLM-revised guidelines (0.88-0.96). An omission, addition, or change in meaning was identified by at least one pharmacist in 30 (20%), 7 (5%), and 18 (12%) of the 153 guideline subsections, respectively. The SMOG (Simple Measure of Gobbledygook) grade showed a small but significant improvement in readability for the LLM-revised guidelines (mean difference 0.32, 95% CI 0.10‐0.55; =.02) and the manually revised versions (mean difference 0.46, 95% CI 0.13‐0.79; =.03). There was no significant difference between the LLM and manually revised versions (>.99). There were no significant differences between Flesch-Kincaid reading grades (=.91). Expert ratings favored the LLM-revised versions for understandability. Considering 2 IMG guidelines from previous research, user testing produced a greater improvement in readability than LLM revision. Conclusions: Authors should not use current LLMs to modify clinical guidelines without carefully checking the revised text for unintended omissions, additions, or changes in meaning. Further work should investigate the potential of LLMs to augment manual user testing and reduce the barriers to the wider use of this approach to improve the safety of clinical guidelines.

Read the full article ›

Posted in: Guidelines Plus on 03/16/2026 | Link to this post on IFP |
Share

Primary Sidebar

Categories

Category RSS Feeds

  • Calls & Consultations
  • Clinical Trials
  • Funding
  • Grey Literature
  • Guidelines Plus
  • History
  • Infographics
  • Journal Article Abstracts
  • Meta-analyses - Systematic Reviews
  • Monographs & Edited Collections
  • News
  • Open Access Journal Articles
  • Podcasts
  • Video

© 1993-2026 Dr. Gary Holden. All rights reserved.

gary.holden@nyu.edu
@Info4Practice