首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >A Novel Cost-Based Model for Data Repairing
【24h】

A Novel Cost-Based Model for Data Repairing

机译:一种新颖的基于成本的数据修复模型

获取原文
获取原文并翻译 | 示例

摘要

Integrity constraint based data repairing is an iterative process consisting of two parts: detect and group errors that violate given integrity constraints (ICs); and modify values inside each group such that the modified database satisfies those ICs. However, most existing automatic solutions treat the process of detecting and grouping errors straightforwardly (e.g., violations of functional dependencies using string equality), while putting more attention on heuristics of modifying values within each group. In this paper, we propose a revised semantics of violations and data consistency w.r.t. a set of ICs. The revised semantics relies on string similarities, in contrast to traditional methods that use syntactic error detection using string equality. Along with the revised semantics, we also propose a new cost model to quantify the cost of data repair by considering distances between strings. We show that the revised semantics provides a significant change for better detecting and grouping errors, which in turn improves both precision and recall of the following data repairing step. We prove that finding minimum-cost repairs in the new model is NP-hard, even for a single FD. We devise efficient algorithms to find approximate repairs. In addition, we develop indices and optimization techniques to improve the efficiency. Experiments show that our approach significantly outperforms existing automatic repair algorithms in both precision and recall.
机译:基于完整性约束的数据修复是一个由两个部分组成的迭代过程:检测和组合违反给定完整性约束(IC)的错误;并修改每个组中的值,以使修改后的数据库满足这些IC。但是,大多数现有的自动解决方案都直接处理检测和分组错误的过程(例如,使用字符串相等性违反功能依赖关系),同时将更多的注意力放在修改每个组中的值的启发式方法上。在本文中,我们提出了修改后的违规语义和w.r.t.数据一致性。一组IC。与使用字符串相等性使用句法错误检测的传统方法相比,修订后的语义依赖于字符串相似性。除了修订后的语义外,我们还提出了一种新的成本模型,通过考虑字符串之间的距离来量化数据修复的成本。我们显示,修改后的语义为更好地检测和分组错误提供了重要的更改,从而又提高了精度,并提高了后续数据修复步骤的召回率。我们证明,即使对于单个FD,在新模型中找到最低成本的维修也是NP难的。我们设计出有效的算法来查找大概的维修量。另外,我们开发指标和优化技术以提高效率。实验表明,我们的方法在精度和召回率上均大大优于现有的自动修复算法。

著录项

  • 来源
  • 作者单位

    Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China;

    Qatar Computing Research Institute, Hamad Bin Khalifa Univeristy, Doha, Qatar;

    Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China;

    Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China;

    Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China;

    Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Maintenance engineering; Urban areas; Semantics; Integrated circuits; Databases; Education; Fault tolerance;

    机译:维护工程;城市地区;语义;集成电路;数据库;教育;容错;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号