...
首页> 外文期刊>Symmetry >Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation
【24h】

Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation

机译:基于最大依赖集和属性关联的不一致数据清除

获取原文
           

摘要

In banks, governments, and Internet companies, inconsistent data problems may often arise when various information systems are collecting, processing, and updating data due to human or equipment reasons. The emergence of inconsistent data makes it impossible to obtain correct information from the data and reduces its availability. Such problems may be fatal in data-intensive enterprises, which causes huge economic losses. Moreover, it is very difficult to clean inconsistent data in databases, especially for data containing conditional functional dependencies with built-in predicates (CFDPs), because it tends to contain more candidate repair values. For the inconsistent data containing CFD P s to detect incomplete and repair difficult problems in databases, we propose a dependency lifting algorithm (DLA) based on the maximum dependency set (MDS) and a reparation algorithm (C-Repair) based on integrating the minimum cost and attribute correlation, respectively. In detection, we find recessive dependencies from the original dependency set to obtain the MDS and improve the original algorithm by dynamic domain adjustment, which extends the applicability to continuous attributes and improves the detection accuracy. In reparation, we first set up a priority queue (PQ) for elements to be repaired based on the minimum cost idea to select a candidate element; then, we treat the corresponding conflict-free instance ( I n v ) as the training set to learn the correlation among attributes and compute the weighted distance ( WDis ) between the tuple of the candidate element and other tuples in I n v according to the correlation; and, lastly, we perform reparation based on the WDis and re-compute the PQ after each reparation round to improve the efficiency, and use a label, flag , to mark the repaired elements to ensure the convergence at the same time. By setting up a contrast experiment, we compare the DLA with the CFD P s based algorithm, and the C-Repair with a cost-based, interpolation-based algorithm on a simulated instance and a real instance. From the experimental results, the DLA and C-Repair algorithms have better detection and repair ability at a higher time cost.
机译:在银行,政府和Internet公司中,由于人力或设备原因,当各种信息系统正在收集,处理和更新数据时,经常会出现不一致的数据问题。不一致的数据的出现使得无法从数据中获取正确的信息并降低了其可用性。这些问题在数据密集型企业中可能是致命的,从而造成巨大的经济损失。此外,很难清除数据库中不一致的数据,尤其是对于包含带有内置谓词(CFDP)的条件功能依赖项的数据,因为它倾向于包含更多的候选修复值。对于包含CFD P s的不一致数据以检测数据库中的不完整和修复难题,我们提出了基于最大依赖集(MDS)的依赖提升算法(DLA)和基于最小集的补偿算法(C-Repair)成本和属性相关性。在检测中,我们从原始依赖集中找到隐性依赖,以获得MDS,并通过动态域调整来改进原始算法,从而将适用性扩展到连续属性,并提高了检测精度。作为补偿,我们首先根据最小成本的想法为要修复的元素设置一个优先队列(PQ),以选择一个候选元素。然后,将对应的无冲突实例(I n v)作为训练集,以学习属性之间的相关性,并根据相关性计算候选元素的元组与I n v中其他元组之间的加权距离(WDis);最后,我们基于WDis进行修复,并在每次修复后重新计算PQ以提高效率,并使用label标记来标记修复的元素以确保收敛。通过设置对比实验,我们在模拟实例和真实实例上将DLA与基于CFD P s的算法进行了比较,并将C-Repair与基于成本的基于插值的算法进行了比较。从实验结果来看,DLA和C-Repair算法具有更好的检测和修复能力,但时间成本较高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号