To assess the feasibility of applying machine-learning (ML) methods to imputation in the Medical Expenditure Panel Survey (MEPS).
All data come from the 2016-2017 MEPS.
Currently, expenditures for medical encounters in the MEPS are imputed with a predictive mean matching (PMM) algorithm in which a linear regression model is used to predict expenditures for events with (donors) and without (recipients) data. Recipient events and donor events are then matched based on the smallest distance between predicted expenditures, and the donor event’s expenditures are used as the recipient event’s imputation. We replace linear regression algorithm in the PMM framework with ML methods to predict expenditures. We examine five alternatives to linear regression: Gradient Boosting, Random Forests, Extreme Random Forests, Deep Neural Networks, and a Stacked Ensemble approach. Additionally, we introduce an alternative matching scheme which matches on a vector of predicted expenditures by sources of payment instead of a single total expenditure prediction to generate potentially superior matches.
Study data is derived from a large federal survey.
ML algorithms perform better at both prediction and matching imputation than Ordinary Least Squares (OLS), the most common prediction algorithm used in PMM. On average, the Stacked Ensemble approach that combines all the ML algorithms performs best, improving expenditure prediction R2 by 108% (0.156 points) and final imputation R2 by 227% (0.397 points). Matching on a prediction vector also improves alignment of sources of payments between donor and recipient events.
Machine learning algorithms and an alternative matching scheme improve the overall quality of expenditure PMM imputation in the MEPS. These methods may have additional value in other national surveys that currently rely on PMM or similar methods for imputation.
This article is protected by copyright. All rights reserved.