Missingness in categorical data is a common problem in various realapplications. Traditional approaches either utilize only the completeobservations or impute the missing data by some ad hoc methods rather than thetrue conditional distribution of the missing data, thus losing or distortingthe rich information in the partial observations. In this paper, we develop aBayesian nonparametric approach, the Dirichlet Process Mixture of CollapsedProduct-Multinomials (DPMCPM), to model the full data jointly and compute themodel efficiently. By fitting an infinite mixture of product-multinomialdistributions, DPMCPM is applicable for any categorical data regardless of thetrue distribution, which may contain complex association among variables. Underthe framework of latent class analysis, we show that DPMCPM can model generalmissing mechanisms by creating an extra category to denote missingness, whichimplicitly integrates out the missing part with regard to their trueconditional distribution. Through simulation studies and a real application, wedemonstrated that DPMCPM outperformed existing approaches on statisticalinference and imputation for incomplete categorical data of various missingmechanisms. DPMCPM is implemented as the R package MMDai, which is availablefrom the Comprehensive R Archive Network athttps://cran.r-project.org/web/packages/MMDai/index.html.
展开▼