Adapting Nearest Neighbor for Multiple Imputation: Advantages, Challenges, and Drawbacks

Abstract

The U.S. Census Bureau has historically used nearest neighbor (NN) or random hot deck (RHD) imputation to handle missing data for many types of survey data. Using these methods removes the need to parametrically model values in imputation models. With strong auxiliary information, NN imputation is preferred because it produces more precise estimates than RHD. In addition, NN imputation is robust against a misspecified response mechanism if missingness depends on the auxiliary variable, in contrast to RHD which ignores the auxiliary information. A compromise between these two methods is k-NN imputation, which identifies a set of the k closest neighbors (“donor pool”) and randomly selects a single donor from this set. Recently these methods have been used for multiple imputation (MI), enabling variance estimation via the so-called Rubin’s Combining Rules. The Approximate Bayesian Bootstrap (ABB) is a simple-to-implement algorithm that makes the RHD “proper” for MI. In concept, ABB should work to propagate uncertainty for NN MI; bootstrapping respondents mean each nonrespondent’s one “nearest” donor will not be available for every imputation. However, we demonstrate through simulation that NN MI using ABB leads to variance underestimation. This underestimation is somewhat but not entirely attenuated with k-NN imputation. An alternative approach to variance estimation after MI, bootstrapped MI, eliminates the underestimation with NN imputation, but we show that it suffers from overestimation of variance with nonnegligible sampling fractions under both equal and unequal probability sampling designs. We propose a modification to bootstrapped MI to account for nonnegligible sampling fractions. We compare the performance of RHD and the various NN MI methods under a variety of sampling designs, sampling fractions, distribution shapes, and missingness mechanisms.

Read the full article ›