Impact of imputation of missing values on classification error for discrete data

Farhangfar A; Kurgan L; Dy J

首页> 外文期刊>Pattern Recognition: The Journal of the Pattern Recognition Society >Impact of imputation of missing values on classification error for discrete data

【24h】

Impact of imputation of missing values on classification error for discrete data

机译：缺失值的插补对离散数据的分类误差的影响

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Numerous industrial and research databases include missing values. It is not uncommon to encounter databases that have LIP to a half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data. A common way of dealing with this problem is to impute (fill-in) the missing values. This paper evaluates how the choice of different imputation methods affects the performance of classifiers that are subsequently used with the imputed data. The experiments here focus on discrete data. This paper studies the effect of missing data imputation using five single imputation methods (a mean method, a Hot deck method, a Naive-Bayes method, and the latter two methods with a recently proposed imputation framework) and one multiple imputation method (a polytomous regression based method) on classification accuracy for six popular classifiers (RIPPER, C4.5, K-nearest-neighbor, support vector machine with polynomial and RBF kernels, and Naive-Bayes) on 15 datasets. This experimental study shows that imputation with the tested methods on average improves classification accuracy when compared to classification without imputation. Although the results show that there is no universally best imputation method, Naive-Bayes imputation is shown to give the best results for the RIPPER classifier for datasets with high amount (i.e., 40% and 50%) of missing data, polytomous regression imputation is shown to be the best for support vector machine classifier with polynomial kernel, and the application of the imputation framework is shown to be superior for the support vector machine with RBF kernel and K-nearest-neighbor. The analysis of the quality of the imputation with respect to varying amounts of missing data (i.e., between 5% and 50%) shows that all imputation methods, except for the mean imputation, improve classification error for data with more than 10% of missing data. Finally, some classifiers such as C4.5 and Naive-Bayes were found to be missing data resistant, i.e., they can Produce accurate classification in the presence of missing data, while other classifiers such as K-nearest-neighbor, SVMs and RIPPER benefit from the imputation. (C) 2008 Elsevier Ltd. All rights reserved.

机译：许多工业和研究数据库都包含缺失的值。碰到丢失了一半条目的LIP的数据库并不少见，这使得使用只能处理完整数据的数据分析方法来挖掘它们非常困难。解决此问题的常用方法是估算（填写）缺失值。本文评估不同插补方法的选择如何影响随后与插补数据一起使用的分类器的性能。这里的实验着重于离散数据。本文使用五种单一插补方法（均值方法，Hoteck方法，Naive-Bayes方法以及后两种方法（使用最近提出的插补框架））和一种多插补方法（一种多值方法）研究缺失数据插补的影响。 15个数据集上六个流行的分类器（RIPPER，C4.5，K最近邻，具有多项式和RBF核的支持向量机以及朴素贝叶斯）的分类精度基于回归的方法）。该实验研究表明，与没有插补的分类相比，使用测试方法进行插补平均可以提高分类准确性。尽管结果表明没有通用的最佳插补方法，但对于缺少大量数据（即40％和50％）的数据集，Naive-Bayes插补对于RIPPER分类器显示出最佳结果，但多变量回归插补是在具有多项式核的支持向量机分类器中，插值框架被证明是最好的，而对于带有RBF核和K近邻的支持向量机，插补框架的应用被证明是优越的。针对不同数量的缺失数据（即5％至50％）进行的插补质量分析表明，除平均插补外，所有插补方法均会改善缺失量超过10％的数据的分类误差数据。最后，发现某些分类器（例如C4.5和Naive-Bayes）具有抗丢失数据的能力，即，它们可以在缺少数据的情况下产生准确的分类，而其他分类器（例如K近邻，SVM和RIPPER）则可以受益从归因。（C）2008 Elsevier Ltd.保留所有权利。

著录项

来源
《Pattern Recognition: The Journal of the Pattern Recognition Society》 |2008年第12期|共14页
作者
Farhangfar A; Kurgan L; Dy J;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类自动化技术及设备;
关键词
missing values; classification; imputation of missing values; single imputation; multiple imputations; DATABASES;

机译：缺失值;分类;缺失值的输入;单插补;多插补;数据库;

相似文献

外文文献
中文文献
专利

1. Impact of imputation of missing values on classification error for discrete data [J] . Farhangfar A, Kurgan L, Dy J Pattern Recognition: The Journal of the Pattern Recognition Society . 2008,第12期

机译：缺失值的插补对离散数据的分类误差的影响
2. Imputation of Discrete and Continuous Missing Values in Large Datasets Using Bayesian Based Ant Colony Optimization [J] . R. Devi Priya, R. Sivaraj Arabian Journal for Science and Engineering. Section A, Sciences . 2016,第12期

机译：使用贝叶斯基于蚁群优化的大型数据集中离散和持续缺失值的归纳
3. Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values [J] . Garcia-Laencina Pedro J., Abreu Pedro Henriques, Abreu Miguel Henriques, Computers in Biology and Medicine . 2015,第Null期

机译：离散值未知的乳腺癌患者5年生存预测的数据推算缺失
4. Impact of imputation of missing values on genetic programming based multiple feature construction for classification [C] . Cao Truong Tran, Andreae Peter, Mengjie Zhang IEEE Congress on Evolutionary Computation . 2015

机译：缺失值的归因对基于遗传规划的多特征构造分类的影响
5. Methodological and clinical issues in analysis of data from HIV cardiovascular research: Validity of ultrasound methods, impact of anti-retroviral therapy on atherosclerosis, and imputation of missing values. [D] . Odueyungbo, Adefowope. 2010

机译：HIV心血管研究数据分析中的方法学和临床问题：超声方法的有效性，抗逆转录病毒疗法对动脉粥样硬化的影响以及缺失值的归因。
6. A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction [O] . Zhiyong Hu, Dongping Du 2020

机译：一种新的分析框架用于缺少数据避难和不确定性分类：缺少数据归档和心力衰竭入读预测
7. A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction [O] . Zhiyong Hu, Dongping Du 2020

机译：一种新的分析框架，用于缺少数据避难和不确定性分类：缺少数据归档和心力衰竭入读预测

Impact of imputation of missing values on classification error for discrete data

摘要

著录项

相似文献

相关主题

期刊订阅