首页> 外文会议>Conference on Intelligent Computing: Theory and Applications II; 20040412-20040413; Orlando,FL; US >Experimental analysis of methods for imputation of missing values in databases
【24h】

Experimental analysis of methods for imputation of missing values in databases

机译:对数据库中缺失值的估算方法的实验分析

获取原文
获取原文并翻译 | 示例

摘要

A very important issue faced by researchers and practitioners who use industrial and research databases is incompleteness of data, usually in terms of missing or erroneous values. While some of data analysis algorithms can work with incomplete data, a large portion of them require complete data. Therefore, different strategies, such as deletion of incomplete examples, and imputation (filling) of missing values through variety of statistical and machine learning (ML) procedures, are developed to preprocess the incomplete data. This study concentrates on performing experimental analysis of several algorithms for imputation of missing values, which range from simple statistical algorithms like mean and hot deck imputation to imputation algorithms that work based on application of inductive ML algorithms. Three major families of ML algorithms, such as probabilistic algorithms (e.g. Naieve Bayes), decision tree algorithms (e.g. C4.5), and decision rule algorithms (e.g. CLIP4), are used to implement the ML based imputation algorithms. The analysis is carried out using a comprehensive range of databases, for which missing values were introduced randomly. The goal of this paper is to provide general guidelines on selection of suitable data imputation algorithms based on characteristics of the data. The guidelines are developed by performing a comprehensive experimental comparison of performance of different data imputation algorithms.
机译:使用工业和研究数据库的研究人员和从业人员面临的一个非常重要的问题是数据的不完整,通常是缺失或错误的值。尽管某些数据分析算法可以处理不完整的数据,但其中很大一部分需要完整的数据。因此,开发了各种策略,例如通过各种统计和机器学习(ML)程序来删除不完整的示例以及对缺失值进行插补(填充),以预处理不完整的数据。这项研究的重点是对几种用于估算缺失值的算法进行实验分析,范围从简单的统计算法(例如均值和热甲板估算)到基于归纳ML算法应用的估算算法。 ML算法的三个主要系列(例如概率算法(例如Naieve Bayes),决策树算法(例如C4.5)和决策规则算法(例如CLIP4))用于实现基于ML的插补算法。使用范围广泛的数据库进行分析,针对这些数据库随机引入缺失值。本文的目的是为根据数据特征选择合适的数据插补算法提供一般指导。该指南是通过对不同数据插补算法的性能进行全面的实验比较而制定的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号