首页> 外文会议>Gesellschaft fur Klassifikation >Data Preparation in Large Real-World Data Mining Projects: Methods for Imputing Missing Values
【24h】

Data Preparation in Large Real-World Data Mining Projects: Methods for Imputing Missing Values

机译:大型现实世界数据挖掘项目中的数据准备:抵御缺失值的方法

获取原文

摘要

One of the most important aspects in data preprocessing for data raining concerns the handling and imputation of missing values. While differences in the performance of varying state-of-the-art algorithms on the same dataset remain usually rather small, the quality of missing value handling can have dramatic consequences and is often crucial for the success of the following model building. This paper explores the consequences of two major missing value replacement strategies (replace-with-mean and multivariate regression) for the performance of classification models: By using a complete real-world dataset for a binary classification problem (churn in financial services), the hit rates of different data mining algorithms are benchmarked for the case of no missing values being present. Then, different missing value patterns (MCAR, MAR and IM) are simulated by deleting predictor values from the training samples following those patterns. After this, the two imputation strategies (replace with mean and regression) are used to recreate complete training datasets, in order to build classification models on them. Finally, the hit rates of the models are determined on (the original complete, not imputed) hold-out test sets and the performances of the models are compared. It is clearly shown, that the regression strategy outperforms by far the simpler replace-with-mean imputation by introducing much less artificial bias in the data and thus enabling better models to be built. The results underline the performance advantages of more complex and time-consuming multivariate imputation schemes over the straightforward replace-with-mean techniques unfortunately implemented in many commercial data mining packages.
机译:数据预处理的最重要方面是数据下雨的处理涉及缺失值的处理和归咎。虽然在同一数据集上变化的最新算法的性能的差异通常相当小,但缺失值处理的质量可能具有巨大的后果,并且对于以下模型建筑的成功通常是至关重要的。本文探讨了两个主要缺失价值替代策略(替换 - 均值和多变量回归)的后果,以便进行分类模型的性能:通过使用完整的真实世界数据集进行二进制分类问题(在金融服务中搅拌),不同数据挖掘算法的命中率为缺失值的情况是基准测试。然后,通过删除从这些模式之后的训练样本中删除预测值值来模拟不同缺失值模式(MCAR,MAR和IM)。在此之后,两个撤销策略(替换为均值和回归)将用于重新创建完整的训练数据集,以便在它们上构建分类模型。最后,确定模型的命中率(原始完整的,未避阻)的保持测试集,并比较模型的性能。清楚地示出,回归策略在迄今为止的比较更简单的替换性归档,通过在数据中引入更少的人为偏差,因此能够建立更好的模型。结果强调了在许多商业数据挖掘包中实施的直接替换的均衡技术更复杂和耗时的多变量估算方案的性能优势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号