首页> 外文会议>International Multidisciplinary Information Technology and Engineering Conference >A Comparison of Strategies for Missing Values in Data on Machine Learning Classification Algorithms
【24h】

A Comparison of Strategies for Missing Values in Data on Machine Learning Classification Algorithms

机译:机器学习分类算法中数据缺失值的策略比较

获取原文

摘要

Dealing with missing values in data is an important feature engineering task in data science to prevent negative impacts on machine learning classification models in terms of accurate prediction. However, it is often unclear what the underlying cause of the missing values in real-life data is or rather the missing data mechanism that is causing the missingness. Thus, it becomes necessary to evaluate several missing data approaches for a given dataset. In this paper, we perform a comparative study of several approaches for handling missing values in data, namely listwise deletion, mean, mode, k-nearest neighbors, expectation-maximization, and multiple imputations by chained equations. The comparison is performed on two real-world datasets, using the following evaluation metrics: Accuracy, root mean squared error, receiver operating characteristics, and the F1 score. Most classifiers performed well across the missing data strategies. However, based on the result obtained, the support vector classifier method overall performed marginally better for the numerical data and naïve Bayes classifier for the categorical data when compared to the other evaluated missing value methods.
机译:处理数据中的缺失值是数据科学中一项重要的功能工程任务,目的是防止就准确预测而言对机器学习分类模型产生负面影响。但是,通常不清楚现实数据中缺失值的根本原因是,还是造成缺失的数据机制。因此,有必要评估给定数据集的几种缺失数据方法。在本文中,我们对几种处理数据中缺失值的方法进行了比较研究,即按列表删除,均值,众数,k最近邻,期望最大化和链式方程的多重插补。使用以下评估指标对两个真实世界的数据集进行比较:准确性,均方根误差,接收机工作特性和F1分数。大多数分类器在缺失的数据策略中表现良好。但是,根据获得的结果,与其他评估的缺失值方法相比,支持向量分类器方法总体而言对数值数据的性能稍好,对于分类数据而言则为朴素的贝叶斯分类器。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号