首页> 外文期刊>Vietnam Journal of Computer Science >Inferring the cause of errors for a scalable, accurate, and complete constraint-based data cleansing
【24h】

Inferring the cause of errors for a scalable, accurate, and complete constraint-based data cleansing

机译:推断错误原因,以实现基于约束的可伸缩,准确和完整的数据清理

获取原文
       

摘要

Abstract In real-world dirty data, errors are often not randomly distributed. Rather, they tend to occur only under certain conditions, such as when the transaction is handled by a certain operator, or the weather is rainy. Leveraging such common conditions, or “cause conditions”, the proposed data-cleansing algorithm resolves multi-tuple conflicts with high speed, achieves higher completeness, and runs with high accuracy in realistic settings. We first present complexity analyses of the problem, pointing out two subproblems that are NP-complete. We then introduce, for each subproblem, heuristics that work in sub-polynomial time. We also raise the issue that some previous studies overlook the notion of repair-completeness, which means, having less number of unsolved conflicts in the resulting repairs. The proposed method is capable of obtaining a complete repair if we are allowed to preprocess the input set of constraints. The algorithms are tested with three sets of data and rules. The experiments show that, compared to the state-of-the-art methods for conditional functional dependencies-based and FD-based data cleansing, the proposed algorithm scales better with respect to the data size, is the only method that outputs complete repairs, and is more accurate especially when the error distribution is skewed.
机译:摘要在实际的脏数据中,错误通常不是随机分布的。而是,它们倾向于仅在某些条件下发生,例如,当交易由某个操作员处理或天气阴雨时。利用这种常见条件或“原因条件”,提出的数据清理算法可快速解决多元组冲突,实现更高的完整性,并在实际设置中以高精度运行。我们首先对问题进行复杂性分析,指出两个完全NP问题。然后,我们为每个子问题介绍在次多项式时间内起作用的启发式方法。我们还提出了一个问题,即以前的一些研究忽略了维修完成的概念,这意味着在所产生的维修中未解决的冲突数量减少了。如果允许我们预处理输入的约束集,则所提出的方法能够获得完整的修复。使用三组数据和规则对算法进行了测试。实验表明,与基于条件功能依赖项和基于FD的数据清理的最新方法相比,所提出的算法在数据大小方面具有更好的伸缩性,是唯一可以输出完整修复的方法,并且更准确,尤其是当错误分布偏斜时。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号