首页> 外文期刊>Pattern Recognition: The Journal of the Pattern Recognition Society >Impact of imputation of missing values on classification error for discrete data
【24h】

Impact of imputation of missing values on classification error for discrete data

机译:缺失值的插补对离散数据的分类误差的影响

获取原文
获取原文并翻译 | 示例
           

摘要

Numerous industrial and research databases include missing values. It is not uncommon to encounter databases that have LIP to a half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data. A common way of dealing with this problem is to impute (fill-in) the missing values. This paper evaluates how the choice of different imputation methods affects the performance of classifiers that are subsequently used with the imputed data. The experiments here focus on discrete data. This paper studies the effect of missing data imputation using five single imputation methods (a mean method, a Hot deck method, a Naive-Bayes method, and the latter two methods with a recently proposed imputation framework) and one multiple imputation method (a polytomous regression based method) on classification accuracy for six popular classifiers (RIPPER, C4.5, K-nearest-neighbor, support vector machine with polynomial and RBF kernels, and Naive-Bayes) on 15 datasets. This experimental study shows that imputation with the tested methods on average improves classification accuracy when compared to classification without imputation. Although the results show that there is no universally best imputation method, Naive-Bayes imputation is shown to give the best results for the RIPPER classifier for datasets with high amount (i.e., 40% and 50%) of missing data, polytomous regression imputation is shown to be the best for support vector machine classifier with polynomial kernel, and the application of the imputation framework is shown to be superior for the support vector machine with RBF kernel and K-nearest-neighbor. The analysis of the quality of the imputation with respect to varying amounts of missing data (i.e., between 5% and 50%) shows that all imputation methods, except for the mean imputation, improve classification error for data with more than 10% of missing data. Finally, some classifiers such as C4.5 and Naive-Bayes were found to be missing data resistant, i.e., they can Produce accurate classification in the presence of missing data, while other classifiers such as K-nearest-neighbor, SVMs and RIPPER benefit from the imputation. (C) 2008 Elsevier Ltd. All rights reserved.
机译:许多工业和研究数据库都包含缺失的值。碰到丢失了一半条目的LIP的数据库并不少见,这使得使用只能处理完整数据的数据分析方法来挖掘它们非常困难。解决此问题的常用方法是估算(填写)缺失值。本文评估不同插补方法的选择如何影响随后与插补数据一起使用的分类器的性能。这里的实验着重于离散数据。本文使用五种单一插补方法(均值方法,Hoteck方法,Naive-Bayes方法以及后两种方法(使用最近提出的插补框架))和一种多插补方法(一种多值方法)研究缺失数据插补的影响。 15个数据集上六个流行的分类器(RIPPER,C4.5,K最近邻,具有多项式和RBF核的支持向量机以及朴素贝叶斯)的分类精度基于回归的方法)。该实验研究表明,与没有插补的分类相比,使用测试方法进行插补平均可以提高分类准确性。尽管结果表明没有通用的最佳插补方法,但对于缺少大量数据(即40%和50%)的数据集,Naive-Bayes插补对于RIPPER分类器显示出最佳结果,但多变量回归插补是在具有多项式核的支持向量机分类器中,插值框架被证明是最好的,而对于带有RBF核和K近邻的支持向量机,插补框架的应用被证明是优越的。针对不同数量的缺失数据(即5%至50%)进行的插补质量分析表明,除平均插补外,所有插补方法均会改善缺失量超过10%的数据的分类误差数据。最后,发现某些分类器(例如C4.5和Naive-Bayes)具有抗丢失数据的能力,即,它们可以在缺少数据的情况下产生准确的分类,而其他分类器(例如K近邻,SVM和RIPPER)则可以受益从归因。 (C)2008 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号