【24h】

Systematic Development of Data Mining-Based Data Quality Tools

机译:基于数据挖掘的数据质量工具的系统开发

获取原文
获取原文并翻译 | 示例

摘要

Data quality problems have been a persistent concern especially for large historically grown databases. If maintained over long periods, interpretation and usage of their schemas often shifts. Therefore, traditional data scrubbing techniques based on existing schema and integrity constraint documentation are hardly applicable. So-called data auditing environments circumvent this problem by using machine learning techniques in order to induce semantically meaningful structures from the actual data, and then classifying outliers that do not fit the induced schema as potential errors. However, as the quality of the analyzed database is a-priori unknown, the design of data auditing environments requires special methods for the calibration of error measurements based on the induced schema. In this paper, we present a data audit test generator that systematically generates and pollutes artificial benchmark databases for this purpose. The test generator has been implemented as part of a data auditing environment based on the well-known machine learning algorithm C4.5. Validation in the partial quality audit of a large service-related database at Daimler-Chrysler shows the usefulness of the approach as a complement to standard data scrubbing.
机译:数据质量问题一直是一个持续存在的问题,特别是对于历史悠久的大型数据库。如果长期维护,其模式的解释和使用通常会发生变化。因此,基于现有架构和完整性约束文档的传统数据清理技术几乎不适用。所谓的数据审核环境通过使用机器学习技术来规避此问题,以便从实际数据中得出语义上有意义的结构,然后将不符合所生成模式的异常值分类为潜在错误。但是,由于所分析数据库的质量是先验未知的,因此数据审核环境的设计需要特殊的方法,以基于导出的模式来校准错误测量。在本文中,我们提供了一个数据审核测试生成器,该生成器可以为此目的系统地生成和污染人工基准数据库。基于众所周知的机器学习算法C4.5,测试生成器已实现为数据审核环境的一部分。在戴姆勒-克莱斯勒对大型服务相关数据库进行的部分质量审核中的验证表明,该方法作为对标准数据清理的补充是有用的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号