首页> 外文会议>International conference on very large data bases >Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms
【24h】

Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

机译:迷上Bart:错误生成,用于评估数据清理算法

获取原文

摘要

We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.
机译:我们研究了将错误引入干净的数据库中的问题,目的是对数据清理算法进行基准测试。我们的目标是为用户提供对错误生成过程的最大程度的控制,同时开发可扩展到大型数据库的解决方案。我们在论文中表明,错误生成问题极具挑战性,实际上是NP完全问题。为了提供可扩展的解决方案,我们开发了一种正确而有效的贪婪算法,该算法牺牲了完整性,但在非常合理的假设下成功了。为了扩展到数百万个元组,该算法依赖于几种非平凡的优化,包括数据质量约束的新对称性。控制和可伸缩性之间的权衡是本文的主要技术贡献。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号