首页> 外文会议>Gesellschaft fur Klassifikation >Data Preparation in Large Real-World Data Mining Projects: Methods for Imputing Missing Values

【24h】

Data Preparation in Large Real-World Data Mining Projects: Methods for Imputing Missing Values

机译：大型现实世界数据挖掘项目中的数据准备：抵御缺失值的方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

One of the most important aspects in data preprocessing for data raining concerns the handling and imputation of missing values. While differences in the performance of varying state-of-the-art algorithms on the same dataset remain usually rather small, the quality of missing value handling can have dramatic consequences and is often crucial for the success of the following model building. This paper explores the consequences of two major missing value replacement strategies (replace-with-mean and multivariate regression) for the performance of classification models: By using a complete real-world dataset for a binary classification problem (churn in financial services), the hit rates of different data mining algorithms are benchmarked for the case of no missing values being present. Then, different missing value patterns (MCAR, MAR and IM) are simulated by deleting predictor values from the training samples following those patterns. After this, the two imputation strategies (replace with mean and regression) are used to recreate complete training datasets, in order to build classification models on them. Finally, the hit rates of the models are determined on (the original complete, not imputed) hold-out test sets and the performances of the models are compared. It is clearly shown, that the regression strategy outperforms by far the simpler replace-with-mean imputation by introducing much less artificial bias in the data and thus enabling better models to be built. The results underline the performance advantages of more complex and time-consuming multivariate imputation schemes over the straightforward replace-with-mean techniques unfortunately implemented in many commercial data mining packages.

机译：数据预处理的最重要方面是数据下雨的处理涉及缺失值的处理和归咎。虽然在同一数据集上变化的最新算法的性能的差异通常相当小，但缺失值处理的质量可能具有巨大的后果，并且对于以下模型建筑的成功通常是至关重要的。本文探讨了两个主要缺失价值替代策略（替换 - 均值和多变量回归）的后果，以便进行分类模型的性能：通过使用完整的真实世界数据集进行二进制分类问题（在金融服务中搅拌），不同数据挖掘算法的命中率为缺失值的情况是基准测试。然后，通过删除从这些模式之后的训练样本中删除预测值值来模拟不同缺失值模式（MCAR，MAR和IM）。在此之后，两个撤销策略（替换为均值和回归）将用于重新创建完整的训练数据集，以便在它们上构建分类模型。最后，确定模型的命中率（原始完整的，未避阻）的保持测试集，并比较模型的性能。清楚地示出，回归策略在迄今为止的比较更简单的替换性归档，通过在数据中引入更少的人为偏差，因此能够建立更好的模型。结果强调了在许多商业数据挖掘包中实施的直接替换的均衡技术更复杂和耗时的多变量估算方案的性能优势。

著录项

来源
《Gesellschaft fur Klassifikation》|2003年||共9页
会议地点
作者
Th. Liehr;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 O212-532;
关键词

相似文献

外文文献
中文文献
专利

1. METHODS AND CONCEPTS OF DATA MINING TECHNIQUES TO IMPUTE MISSING DATA INFORMATION [J] . B. S. Panda, Ashok Misra, S. S. Gantayat Far East Journal of Electronics and Communications . 2019,第1期

机译：填补数据信息缺失的数据挖掘技术的方法和概念
2. METHODS AND CONCEPTS OF DATA MINING TECHNIQUES TO IMPUTE MISSING DATA INFORMATION [J] . B. S. Panda, Ashok Misra, S. S. Gantayat Far East Journal of Electronics and Communications . 2019,第1期

机译：数据挖掘技术的方法和概念，赋予缺失数据信息
3. A matrix completion-based multiview learning method for imputing missing values in buoy monitoring data [J] . Qin Mengjiao, Du Zhenhong, Zhang Feng, Information Sciences: An International Journal . 2019,第期

机译：基于矩阵完成的多视图学习方法，用于抑制浮标监控数据中缺失值
4. Data Preparation in Large Real-World Data Mining Projects: Methods for Imputing Missing Values [C] . Th. Liehr Gesellschaft fur Klassifikation . 2003

机译：大型现实世界数据挖掘项目中的数据准备：抵御缺失值的方法
5. Data mining applications for updating missing values of traffic counts. [D] . Zhong, Ming. 2004

机译：数据挖掘应用程序，用于更新流量计数的缺失值。
6. Imputing missing genotypes: effects of methods and patterns of missing data [O] . Funda Ogut, Fikret Isik, Steven McKeand, 2011

机译：估算缺失的基因型：缺失数据的方法和模式的影响
7. Figure 1: Proportion of values imputed correctly (accuracy) and 95 confidence interval for different imputation methods across varying amounts of missing data. [O] . -1

机译：图1：不同丢失数据的不同估算方法的正确（精度）和95％置信区间的比例为95％的置信区间。
8. Overview of Methodology for Imputing Missing Expenditure Data in the Medical Expenditure Panel Survey. Methodology Report No. 19 [R] . 2007

机译：在医疗支出小组调查中输入缺失支出数据的方法概述。方法报告第19号

Data Preparation in Large Real-World Data Mining Projects: Methods for Imputing Missing Values

摘要

著录项

相似文献

相关主题

期刊订阅