首页> 外文期刊>Statistica Sinica >CLASSIFICATION AND REGRESSION TREES AND FORESTS FOR INCOMPLETE DATA FROM SAMPLE SURVEYS
【24h】

CLASSIFICATION AND REGRESSION TREES AND FORESTS FOR INCOMPLETE DATA FROM SAMPLE SURVEYS

机译:来自样品调查的不完全数据的分类和回归树木和森林

获取原文
获取原文并翻译 | 示例
           

摘要

Analysis of sample survey data often requires adjustments for missing values in the variables of interest. Standard adjustments based on item imputation or on propensity weighting factors rely on the availability of auxiliary variables for both responding and non-responding units. Their application can be challenging when the auxiliary variables are numerous and are themselves subject to incomplete-data problems. This paper shows how classification and regression trees and forests can overcome these difficulties and compares them with likelihood methods in terms of bias and mean squared error. The development centers on a component of income data from the U.S. Consumer Expenditure Survey, which has a relatively high rate of item missingness. Classification trees and forests are used to model the unit-level propensity for item missingness in the income component. Regression trees and forests are used to model the conditional mean of the income component. The methods are then used to estimate the mean of the income component, adjusted for item nonresponse. Thirteen methods for estimating a population mean are compared in simulation experiments. The results show that if the number of auxiliary variables with missing values is not small, or if they have substantial missingness rates, likelihood methods can be impracticable or inapplicable. Tree and forest methods are always applicable, are relatively fast, and have higher efficiency than likelihood methods under real-data situations with incomplete-data patterns similar to that in the abovementioned survey. Their efficiency loss under parametric conditions most favorable to likelihood methods is observed to be between 10-25%.
机译:样本调查数据的分析通常需要调整缺失值的缺失值。基于项目估算或倾向加权因子的标准调整依赖于响应和非响应单位的辅助变量的可用性。当辅助变量无数时,他们的应用程序可能是挑战,并且本身可能受到不完整的数据问题。本文展示了分类和回归树木和森林如何克服这些困难,并将它们与偏见和均方误差的似然方法进行比较。从美国消费者支出调查的收入数据成分的发展中心,该调查具有相对较高的物品缺失。分类树木和森林用于模拟收入组件中物品缺失的单位级别倾向。回归树木和森林用于建模收入组件的条件平均值。然后使用该方法来估计为项目非响应进行调整的收入组件的平均值。在仿真实验中比较了估计群体平均值的十三种方法。结果表明,如果具有缺失值的辅助变量的数量不小,或者如果它们具有实质性缺失率,则可能性方法可能是不切实际的或不可行的。树和森林方法始终适用,比较速度,并且具有比实际数据情况下的似然方法更高的效率,其不完整的数据模式类似于上述调查中的数据模式。观察到似乎最有利的参数条件下的效率损失在10-25%之间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号