A Novel Cost-Based Model for Data Repairing

Shuang Hao; Nan Tang; Guoliang Li; Jian He; Na Ta; Jianhua Feng

首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >A Novel Cost-Based Model for Data Repairing

【24h】

A Novel Cost-Based Model for Data Repairing

机译：一种新颖的基于成本的数据修复模型

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Integrity constraint based data repairing is an iterative process consisting of two parts: detect and group errors that violate given integrity constraints (ICs); and modify values inside each group such that the modified database satisfies those ICs. However, most existing automatic solutions treat the process of detecting and grouping errors straightforwardly (e.g., violations of functional dependencies using string equality), while putting more attention on heuristics of modifying values within each group. In this paper, we propose a revised semantics of violations and data consistency w.r.t. a set of ICs. The revised semantics relies on string similarities, in contrast to traditional methods that use syntactic error detection using string equality. Along with the revised semantics, we also propose a new cost model to quantify the cost of data repair by considering distances between strings. We show that the revised semantics provides a significant change for better detecting and grouping errors, which in turn improves both precision and recall of the following data repairing step. We prove that finding minimum-cost repairs in the new model is NP-hard, even for a single FD. We devise efficient algorithms to find approximate repairs. In addition, we develop indices and optimization techniques to improve the efficiency. Experiments show that our approach significantly outperforms existing automatic repair algorithms in both precision and recall.

机译：基于完整性约束的数据修复是一个由两个部分组成的迭代过程：检测和组合违反给定完整性约束（IC）的错误；并修改每个组中的值，以使修改后的数据库满足这些IC。但是，大多数现有的自动解决方案都直接处理检测和分组错误的过程（例如，使用字符串相等性违反功能依赖关系），同时将更多的注意力放在修改每个组中的值的启发式方法上。在本文中，我们提出了修改后的违规语义和w.r.t.数据一致性。一组IC。与使用字符串相等性使用句法错误检测的传统方法相比，修订后的语义依赖于字符串相似性。除了修订后的语义外，我们还提出了一种新的成本模型，通过考虑字符串之间的距离来量化数据修复的成本。我们显示，修改后的语义为更好地检测和分组错误提供了重要的更改，从而又提高了精度，并提高了后续数据修复步骤的召回率。我们证明，即使对于单个FD，在新模型中找到最低成本的维修也是NP难的。我们设计出有效的算法来查找大概的维修量。另外，我们开发指标和优化技术以提高效率。实验表明，我们的方法在精度和召回率上均大大优于现有的自动修复算法。

著录项

来源
《IEEE Transactions on Knowledge and Data Engineering》 |2017年第4期|727-742|共16页
作者
Shuang Hao; Nan Tang; Guoliang Li; Jian He; Na Ta; Jianhua Feng;
展开▼
作者单位

Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China;

Qatar Computing Research Institute, Hamad Bin Khalifa Univeristy, Doha, Qatar;

Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China;

Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China;

Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China;

Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Maintenance engineering; Urban areas; Semantics; Integrated circuits; Databases; Education; Fault tolerance;

机译：维护工程;城市地区;语义;集成电路;数据库;教育;容错;

相似文献

外文文献
中文文献
专利

1. A Novel Cost-Based Model for Data Repairing [J] . Hao Shuang, Tang Nan, Li Guoliang, Theoretical and Experimental Plant Physiology . 2017,第4期

机译：基于成本的数据修复模型
2. Utilizing FEM-Software to quantify pre- and post-interventional cardiac reconstruction data based on modelling data sets from surgical ventricular repair therapy (SVRT) and cardiac resynchronisation therapy (CRT) [J] . Janko F Verhey, Nadia S Nathan BioMedical Engineering OnLine . 2006,第1期

机译：利用FEM软件基于外科心室修复疗法（SVRT）和心脏再同步疗法（CRT）的建模数据集对介入前后的心脏重建数据进行量化
3. The Cost-Based Lean Approach to the Information Logistics Business System Modelling [J] . Robert Bucki, Petr Suchanek Journal of Computing and Information Technology . 2019,第1期

机译：基于成本的精益方法在信息物流业务系统建模中的应用
4. A Novel Cost-Based Model for Data Repairing [C] . Shuang Hao, Nan Tang, Guoliang Li, IEEE International Conference on Data Engineering . 2017

机译：一种新颖的基于成本的数据修复模型
5. An analysis of the Oracle database execution plan and the cost-based optimizer. [D] . Sharpe, Donald Jeremy. 2004

机译：Oracle数据库执行计划和基于成本的优化器的分析。
6. Utilizing FEM-Software to quantify pre- and post-interventional cardiac reconstruction data based on modelling data sets from surgical ventricular repair therapy (SVRT) and cardiac resynchronisation therapy (CRT) [O] . Janko F Verhey, Nadia S Nathan 2006

机译：利用FEM软件基于外科心室修复疗法（SVRT）和心脏再同步疗法（CRT）的建模数据集来量化介入前后的心脏重建数据
7. A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification [O] . Philip Bohannon 2005

机译：基于成本的模型和有效的启发式修复约束的价值修改
8. Cost-based Modeling for Fraud and Intrusion Detection: Results from the JAM Project [R] . Stolfo, S. J., Fan, W., Lee, W., 2000

机译：欺诈和入侵检测的基于成本的建模：Jam项目的结果

A Novel Cost-Based Model for Data Repairing

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅