首页> 外文会议>International conference on management of data >Interaction between Record Matching and Data Repairing
【24h】

Interaction between Record Matching and Data Repairing

机译:记录匹配与数据修复之间的相互作用

获取原文

摘要

Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using constraints. These are treated as separate processes in current data cleaning systems, based on heuristic solutions. This paper studies a new problem, namely, the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we propose a uniform framework that seamlessly unifies repairing and matching operations, to clean a database based on integrity constraints, matching rules and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP- or coNP-complete, to PSPACE-complete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analysis, respectively, which are more accurate than possible fixes generated by heuristics. We experimentally verify that our techniques significantly improve the accuracy of record matching and data repairing taken as separate processes, using real-life data.
机译:数据清理系统的核心是记录匹配和数据修复。匹配旨在识别引用同一真实世界对象的元组,而修复则是通过使用约束来修复数据中的错误,从而使数据库保持一致。基于启发式解决方案,这些在当前的数据清理系统中被视为独立的过程。本文研究了一个新的问题,即记录匹配和数据修复之间的相互作用。我们表明,修复可以有效地帮助我们识别匹配,反之亦然。为了捕获交互,我们提出了一个统一的框架,该框架无缝地统一了修复和匹配操作,以基于完整性约束,匹配规则和主数据来清理数据库。我们通过匹配和修复对与数据清理相关的基本问题进行了全面处理,包括对约束和规则进行的静态分析以及数据清理的复杂性,终止和确定性分析。我们表明,这些问题很难解决,从NP完全或coNP完全到PSPACE完全。但是,我们提出了一种有效的算法,可以通过匹配和修复来清理数据。该算法分别基于置信度和熵分析找到确定性修正和可靠修正,它们比启发式方法生成的可能修正更准确。我们通过实验验证了我们的技术使用真实数据,可以显着提高记录匹配和数据修复(作为独立过程)的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号