Abstract
Background and Objectives
AI chatbots are increasingly used in patient education. For opioid use disorder (OUD), content must be readable and non-stigmatizing. We compared ChatGPT responses with U.S. health organization FAQs on readability, complexity, and stigma.
Methods
We analyzed 50 OUD FAQs paired with ChatGPT GPT-4o responses. Outcomes included word and sentence counts, lexical density, and six readability indices. Paired differences were tested with t tests or Wilcoxon signed-rank tests.
Results
ChatGPT responses were longer than FAQs, with a mean word count of 253.7 vs. 76.6 (difference 177; 95% CI, 151–203) and sentence count of 18.2 vs. 9.0 (difference 9.2; 95% CI, 7.6–10.9). Lexical density was higher by 6.5 percentage points (95% CI, 4.0–9.0), with more characters (0.55; 95% CI, 0.40–0.70) and syllables per word (0.19; IQR, 0.14–0.24). Readability grade levels were consistently higher: Coleman–Liau +3.43, Gunning Fog +3.47, SMOG + 2.96, Flesch–Kincaid +3.61, Automated Readability Index +4.33, and Flesch Reading Ease −20.4 (all p < .05). Stigmatizing term frequency was similar, 0.98 vs. 0.28 per answer (95% CI, −1.3 to +3.3).
Discussion
ChatGPT responses were longer and more complex than FAQs, although the frequency of stigmatizing language was similar.
Conclusions
ChatGPT produced more comprehensive but less readable content than FAQs, revealing a gap with health literacy standards. While stigmatizing terms were uncommon unless simplified, length and complexity may hinder use.
Scientific Significance
Findings quantify readability and stigma trade-offs in AI-generated OUD education and emphasize the need for plain language prompting and human review.