• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

information for practice

news, new scholarship & more from around the world


advanced search
  • gary.holden@nyu.edu
  • @ Info4Practice
  • Archive
  • About
  • Help
  • Browse Key Journals
  • RSS Feeds

Evaluating Rater Effects of Large Language Models in Automated Essay Scoring: GPT, Claude, Gemini, and DeepSeek

Abstract

Large language models (LLMs) have been widely explored for automated scoring in educational assessment to facilitate learning and instruction. However, empirical evidence regarding which LLMs produce the most reliable scores and induce the least rater effects remains limited. This study compared 10 LLMs (ChatGPT 3.5, ChatGPT 4, ChatGPT 4o, OpenAI o1, Claude 3.5 Sonnet, Gemini 1.5, Gemini 1.5 Pro, Gemini 2.0, DeepSeek V3, and DeepSeek R1) with human expert raters in scoring two types of writing tasks. Their performance was evaluated in terms of score accuracy, intra-rater consistency, and rater effects estimated using the Many-Facet Rasch model. Although the results generally supported the use of ChatGPT 4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet with high scoring accuracy, better intra-rater consistency, and less rater effects, the study is not intended to support substantive comparisons or rankings of LLMs or to identify a single “best” model, given the small sample size.

Read the full article ›

Posted in: Journal Article Abstracts on 04/26/2026 | Link to this post on IFP |
Share

Primary Sidebar

Categories

Category RSS Feeds

  • Calls & Consultations
  • Clinical Trials
  • Funding
  • Grey Literature
  • Guidelines Plus
  • History
  • Infographics
  • Journal Article Abstracts
  • Meta-analyses - Systematic Reviews
  • Monographs & Edited Collections
  • News
  • Open Access Journal Articles
  • Podcasts
  • Video

© 1993-2026 Dr. Gary Holden. All rights reserved.

gary.holden@nyu.edu
@Info4Practice