Inferring the cause of errors for a scalable, accurate, and complete constraint-based data cleansing

Ayako Hoshino; Hiroki Nakayama; Chihiro Ito; Kyota Kanno; Kenshi Nishimura

首页> 外文期刊>Vietnam Journal of Computer Science >Inferring the cause of errors for a scalable, accurate, and complete constraint-based data cleansing

【24h】

Inferring the cause of errors for a scalable, accurate, and complete constraint-based data cleansing

机译：推断错误原因，以实现基于约束的可伸缩，准确和完整的数据清理

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Abstract In real-world dirty data, errors are often not randomly distributed. Rather, they tend to occur only under certain conditions, such as when the transaction is handled by a certain operator, or the weather is rainy. Leveraging such common conditions, or “cause conditions”, the proposed data-cleansing algorithm resolves multi-tuple conflicts with high speed, achieves higher completeness, and runs with high accuracy in realistic settings. We first present complexity analyses of the problem, pointing out two subproblems that are NP-complete. We then introduce, for each subproblem, heuristics that work in sub-polynomial time. We also raise the issue that some previous studies overlook the notion of repair-completeness, which means, having less number of unsolved conflicts in the resulting repairs. The proposed method is capable of obtaining a complete repair if we are allowed to preprocess the input set of constraints. The algorithms are tested with three sets of data and rules. The experiments show that, compared to the state-of-the-art methods for conditional functional dependencies-based and FD-based data cleansing, the proposed algorithm scales better with respect to the data size, is the only method that outputs complete repairs, and is more accurate especially when the error distribution is skewed.

机译：摘要在实际的脏数据中，错误通常不是随机分布的。而是，它们倾向于仅在某些条件下发生，例如，当交易由某个操作员处理或天气阴雨时。利用这种常见条件或“原因条件”，提出的数据清理算法可快速解决多元组冲突，实现更高的完整性，并在实际设置中以高精度运行。我们首先对问题进行复杂性分析，指出两个完全NP问题。然后，我们为每个子问题介绍在次多项式时间内起作用的启发式方法。我们还提出了一个问题，即以前的一些研究忽略了维修完成的概念，这意味着在所产生的维修中未解决的冲突数量减少了。如果允许我们预处理输入的约束集，则所提出的方法能够获得完整的修复。使用三组数据和规则对算法进行了测试。实验表明，与基于条件功能依赖项和基于FD的数据清理的最新方法相比，所提出的算法在数据大小方面具有更好的伸缩性，是唯一可以输出完整修复的方法，并且更准确，尤其是当错误分布偏斜时。

著录项

来源
《Vietnam Journal of Computer Science》 |2017年第1期|共10页
作者
Ayako Hoshino; Hiroki Nakayama; Chihiro Ito; Kyota Kanno; Kenshi Nishimura;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
入库时间 2022-08-18 15:21:38

相似文献

外文文献
中文文献
专利

1. Inferring large-scale gene regulatory networks using a low-order constraint-based algorithm [J] . Mingyi Wang, Vagner Augusto Benedito, Patrick Xuechun Zhao, Molecular BioSystems . 2010,第6期

机译：使用基于低阶约束的算法推断大规模基因调控网络
2. Errors in dynamical fields inferred from oceanographic cruise data. Part I. The impact of observation errors and the sampling distribution [J] . Gomis D, Pedder MA Journal of marine systems: journal of the European Association of Marine Sciences and Techniques . 2005,第3a4期

机译：从海洋巡游数据推断出的动力场误差。第一部分：观测误差和抽样分布的影响
3. Can you accurately cleanse your data? [J] . Jade Greenhow Glass & glazing products . 2017,第Nova期

机译：您能准确清洁数据吗？
4. Leveraging the Common Cause of Errors for Constraint-Based Data Cleansing [C] . Ayako Hoshino, Hiroki Nakayama, Chihiro Ito, Trends and applications in knowledge discovery and data mining . 2015

机译：利用错误的常见原因进行基于约束的数据清理
5. Inferring Optimally Precise and Maximally Accurate Models from Electron Microscopy Data. [D] . Greenberg, Charles Harold. 2016

机译：从电子显微镜数据推断最佳精确度和最大精确度的模型。
6. Assessment of Aliasing Errors in Low-Degree Coefficients Inferred from GPS Data [O] . Na Wei, Rongxin Fang 2016

机译：从GPS数据推断出的低度系数混叠误差的评估
7. Inferring the cause of errors for a scalable, accurate, and complete constraint-based data cleansing [O] . 2017

机译：推断错误原因，以实现可扩展，准确且完整的基于约束的数据清理

Inferring the cause of errors for a scalable, accurate, and complete constraint-based data cleansing

摘要

著录项

相似文献

相关主题

期刊订阅