Decision, Vol 12(1), Jan 2025, 43-62; doi:10.1037/dec0000230
In applications involving human forecasters, proper scoring rules such as the Brier score or logarithmic score are the gold standards for forecaster evaluation. They provide incentives for forecaster honesty; they are simple to compute; they are well studied; and they provide a mechanism for forecaster selection and aggregation. But these scoring rules are somewhat inflexible and cannot immediately handle various attributes of forecasting data, including missing forecasts and nested forecasts. In light of these issues, we consider the scoring of forecasters via statistical models that are grounded in the logarithmic scoring rule. We specifically discuss the connection between model log likelihoods and the logarithmic scoring rule, which leads to connections between the statistics literature and the literature on human forecaster evaluation. We use publicly available data from the Good Judgment Project to illustrate the flexibility of model-based evaluations and to compare these evaluations to traditional approaches. We also consider implications for forecaster selection, recalibration, and aggregation. (PsycInfo Database Record (c) 2025 APA, all rights reserved)