首页> 外文OA文献 >Multiple imputation for missing data and statisticaldisclosure control for mixed-mode data using asequence of generalised linear models
【2h】

Multiple imputation for missing data and statisticaldisclosure control for mixed-mode data using asequence of generalised linear models

机译:缺失数据和统计的多重插补使用a的混合模式数据的公开控制广义线性模型的序列

摘要

Multiple imputation is a commonly used approach to deal with missing data and to protect confidentiality of public use data sets. The basic idea is to replace the missing values or sensitive values with multiple imputation, and we then release the multiply imputed data sets to the public. Users can analyze the multiply imputed data sets and obtain valid inferences by using simple combining rules, which take the uncertainty due to the presence of missing values and synthetic values into account. It is crucial that imputations are drawn from the posterior predictive distribution to preserve relationships present in the data and allow valid conclusions to be made from any analysis. In data sets with different types of variables, e.g. some categorical and some continuous variables, multivariate imputation by chained equations (MICE) (Van Buuren (2011)) is a commonly used multiple imputation method. However, imputations from such an approach are not necessarily drawn from a proper posterior predictive distribution. We propose a method, called factored regression model (FRM) to multiply impute missing values in such data sets by modelling the joint distribution of the variables in the data through a sequence of generalised linear models.We use data augmentation methods to connect the categorical and continuous variables and this allows us to draw imputations from a proper posterior distribution. We compare the performance of our method with MICE using simulation studies and on a breastfeeding data. We also extend our modelling strategies to incorporate different informative priors for the FRM to explore robust regression modelling and the sparse relationships between the predictors. We then apply our model to protect confidentiality of the current population survey (CPS) data by generating multiply imputed, partially synthetic data sets. These data sets comprise a mix of original data and the synthetic data where values chosen for synthesis are based on an approach that considers unique and sensitive units in the survey. Valid inference can then be made using the combining rules described by Reiter (2003). An extension to the modelling strategy is also introduced to deal with the presence of spikes at zero in some of the continuous variables in the CPS data.
机译:多重插补是处理丢失数据和保护公用数据集机密性的常用方法。基本思想是用多个插补替换缺失值或敏感值,然后将公开的乘插补数据集发布给公众。用户可以使用简单的组合规则来分析乘数估算数据集并获得有效的推论,该规则考虑了由于缺少值和综合值而导致的不确定性。从后验预测分布中得出推论以保持数据中存在的关系并允许从任何分析中得出有效的结论,这一点至关重要。在具有不同类型变量的数据集中,例如一些分类变量和一些连续变量,通过链式方程进行的多元插补(MICE)(Van Buuren(2011))是一种常用的多重插补方法。但是,从这种方法得出的推论不一定是从适当的后验预测分布中得出的。我们提出了一种称为因数回归模型(FRM)的方法,该方法通过使用一系列广义线性模型对数据中变量的联合分布进行建模来乘以此类数据集中的归因缺失值。连续变量,这使我们能够从适当的后验分布中得出估算值。我们使用模拟研究和母乳喂养数据,将我们的方法与MICE的性能进行了比较。我们还扩展了建模策略,以结合FRM的各种先验知识,以探索稳健的回归建模和预测变量之间的稀疏关系。然后,我们通过生成多个估算的部分合成数据集来应用我们的模型来保护当前人口调查(CPS)数据的机密性。这些数据集包括原始数据和合成数据的混合,其中选择用于合成的值是基于一种考虑调查中唯一且敏感的单位的方法。然后可以使用Reiter(2003)描述的合并规则进行有效推断。还引入了对建模策略的扩展,以处理CPS数据中某些连续变量中零尖峰的存在。

著录项

  • 作者

    Lee Min Cherng;

  • 作者单位
  • 年度 2014
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"en","name":"English","id":9}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号