Systematic Development of Data Mining-Based Data Quality Tools

机译：基于数据挖掘的数据质量工具的系统开发

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data quality problems have been a persistent concern especially for large historically grown databases. If maintained over long periods, interpretation and usage of their schemas often shifts. Therefore, traditional data scrubbing techniques based on existing schema and integrity constraint documentation are hardly applicable. So-called data auditing environments circumvent this problem by using machine learning techniques in order to induce semantically meaningful structures from the actual data, and then classifying outliers that do not fit the induced schema as potential errors. However, as the quality of the analyzed database is a-priori unknown, the design of data auditing environments requires special methods for the calibration of error measurements based on the induced schema. In this paper, we present a data audit test generator that systematically generates and pollutes artificial benchmark databases for this purpose. The test generator has been implemented as part of a data auditing environment based on the well-known machine learning algorithm C4.5. Validation in the partial quality audit of a large service-related database at Daimler-Chrysler shows the usefulness of the approach as a complement to standard data scrubbing.

机译：数据质量问题一直是一个持续存在的问题，特别是对于历史悠久的大型数据库。如果长期维护，其模式的解释和使用通常会发生变化。因此，基于现有架构和完整性约束文档的传统数据清理技术几乎不适用。所谓的数据审核环境通过使用机器学习技术来规避此问题，以便从实际数据中得出语义上有意义的结构，然后将不符合所生成模式的异常值分类为潜在错误。但是，由于所分析数据库的质量是先验未知的，因此数据审核环境的设计需要特殊的方法，以基于导出的模式来校准错误测量。在本文中，我们提供了一个数据审核测试生成器，该生成器可以为此目的系统地生成和污染人工基准数据库。基于众所周知的机器学习算法C4.5，测试生成器已实现为数据审核环境的一部分。在戴姆勒-克莱斯勒对大型服务相关数据库进行的部分质量审核中的验证表明，该方法作为对标准数据清理的补充是有用的。

著录项

来源
《Twenty-ninth International Conference on Very Large Databases; Sep 9-12, 2003; Berlin, Germany》|2003年|p.548-559|共12页
会议地点 Berlin(DE);Berlin(DE)
作者
Dominik Luebbers; Udo Grimmer; Matthias Jarke;
展开▼
作者单位

RWTH Aachen, Informatik V (Information Systems), Ahornstr. 55, 52056 Aachen, Germany;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词
入库时间 2022-08-26 14:15:36

相似文献

外文文献
中文文献
专利

1. Data mining-based model and risk prediction of colorectal cancer by using secondary health data: A systematic review [J] . Hailun Liang, Lei Yang, Lei Tao, 中国癌症研究（英文版） . 2020,第002期

机译：基于数据挖掘的大肠癌次生健康模型和风险预测：系统评价
2. Assessing the perceived quality of brachial artery Flow Mediated Dilation studies for inclusion in meta-analyses and systematic reviews: Description of data employed in the development of a scoring ;tool based on currently accepted guidelines [J] . Arno Greyling, Anke C.C.M van Mil, Peter L. Zock, Data in Brief . 2016,第1期

机译：评估肱动脉流动介导的扩张研究的感知质量，以纳入荟萃分析和系统评价：基于当前公认准则制定评分工具时所用数据的描述
3. Systematic data mining-based framework to discover potential energy waste patterns in residential buildings [J] . Li Jun, Panchabikesan Karthik, Yu Jerry Zhun, Energy and Buildings . 2019,第SEPa期

机译：基于系统数据挖掘的框架可发现住宅建筑中潜在的能源浪费模式
4. Systematic Development of Data Mining-Based Data Quality Tools [C] . Dominik Luebbers, Udo Grimmer, Matthias Jarke International conference on very large databases . 2003

机译：基于数据挖掘的数据质量工具的系统开发
5. Data mining-based inhabitant action predictor for smart homes using controlled synthetic data. [D] . Pundi, Varadharajan Sridhar. 2008

机译：使用受控合成数据的基于数据挖掘的智能家居居民行为预测器。
6. Data mining-based model and risk prediction of colorectal cancer by using secondary health data: A systematic review [O] . Hailun Liang, Lei Yang, Lei Tao, 2020

机译：基于数据挖掘的大肠癌次生健康模型和风险预测：系统评价
7. Systematic development of data mining-based data quality tools [O] . Lübbers Dominik, Grimmer Udo, Jarke Matthias 2003

机译：系统开发基于数据挖掘的数据质量工具

Systematic Development of Data Mining-Based Data Quality Tools

摘要

著录项

相似文献

相关主题

期刊订阅