A Comparison of Strategies for Missing Values in Data on Machine Learning Classification Algorithms

机译：机器学习分类算法中数据缺失值的策略比较

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Dealing with missing values in data is an important feature engineering task in data science to prevent negative impacts on machine learning classification models in terms of accurate prediction. However, it is often unclear what the underlying cause of the missing values in real-life data is or rather the missing data mechanism that is causing the missingness. Thus, it becomes necessary to evaluate several missing data approaches for a given dataset. In this paper, we perform a comparative study of several approaches for handling missing values in data, namely listwise deletion, mean, mode, k-nearest neighbors, expectation-maximization, and multiple imputations by chained equations. The comparison is performed on two real-world datasets, using the following evaluation metrics: Accuracy, root mean squared error, receiver operating characteristics, and the F1 score. Most classifiers performed well across the missing data strategies. However, based on the result obtained, the support vector classifier method overall performed marginally better for the numerical data and naïve Bayes classifier for the categorical data when compared to the other evaluated missing value methods.

机译：处理数据中的缺失值是数据科学中一项重要的功能工程任务，目的是防止就准确预测而言对机器学习分类模型产生负面影响。但是，通常不清楚现实数据中缺失值的根本原因是，还是造成缺失的数据机制。因此，有必要评估给定数据集的几种缺失数据方法。在本文中，我们对几种处理数据中缺失值的方法进行了比较研究，即按列表删除，均值，众数，k最近邻，期望最大化和链式方程的多重插补。使用以下评估指标对两个真实世界的数据集进行比较：准确性，均方根误差，接收机工作特性和F1分数。大多数分类器在缺失的数据策略中表现良好。但是，根据获得的结果，与其他评估的缺失值方法相比，支持向量分类器方法总体而言对数值数据的性能稍好，对于分类数据而言则为朴素的贝叶斯分类器。

著录项

来源
《International Multidisciplinary Information Technology and Engineering Conference》|2019年|1-7|共7页
会议地点 Vanderbijlpark(ZA)
作者
Tebogo Makaba; Eustace Dogo;
展开▼
作者单位

University of Johannesburg Department of Applied Information Systems Johannesburg South Africa;

University of Johannesburg Department of Electrical and Electronic Engineering Johannesburg South Africa;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Measurement; Mice; Classification algorithms; Support vector machines; Data models; Radio frequency; Machine learning;

机译：测量;老鼠;分类算法；支持向量机；数据模型；无线电频率;机器学习;

相似文献

外文文献
中文文献
专利

1. An Enhanced Machine Learning Framework for Type 2 Diabetes Classification Using Imbalanced Data with Missing Values [J] . Kumarmangal Roy, Muneer Ahmad, Kinza Waqar, Complexity . 2021,第a期

机译：使用具有缺失值的不平衡数据的2型糖尿病分类的增强机学习框架
2. A comparison of machine and deep-learning algorithms applied to multisource data for a subtropical forest area classification [J] . International journal of remote sensing . 2020,第5a6期

机译：机器和深度学习算法应用于亚热带森林地区分类的多源数据的比较
3. A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model [J] . Richard Judson, Fathi Elloumi, R Woodrow Setzer, BMC Bioinformatics . 2008,第1期

机译：使用模拟多尺度数据模型对化学毒性分类的机器学习算法的比较
4. A Comparison of Strategies for Missing Values in Data on Machine Learning Classification Algorithms [C] . Tebogo Makaba, Eustace Dogo International Multidisciplinary Information Technology and Engineering Conference . 2019

机译：对机器学习分类算法数据缺失值的策略比较
5. Statistical and machine learning techniques for dealing with missing data in criminal justice: A simulation and comparison of missing data methods. [D] . Hill, Joshua. 2012

机译：统计和机器学习技术，用于处理刑事司法中的缺失数据：缺失数据方法的模拟和比较。
6. A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model [O] . Richard Judson, Fathi Elloumi, R Woodrow Setzer, 2008

机译：使用模拟多尺度数据模型对化学毒性分类的机器学习算法的比较
7. Imputing Missing Values in Mammography Mass Dataset: Will it Increase Classification Performance of Machine Learning Algorithms? [O] . 2017

机译：在乳房X线摄影大众数据集中抵消缺失值：它会增加机器学习算法的分类性能吗？

A Comparison of Strategies for Missing Values in Data on Machine Learning Classification Algorithms

摘要

著录项

相似文献

相关主题

期刊订阅