In banks, governments, and Internet companies, inconsistent data problems may often arise when various information systems are collecting, processing, and updating data due to human or equipment reasons. The emergence of inconsistent data makes it impossible to obtain correct information from the data and reduces its availability. Such problems may be fatal in data-intensive enterprises, which causes huge economic losses. Moreover, it is very difficult to clean inconsistent data in databases, especially for data containing conditional functional dependencies with built-in predicates (CFDPs), because it tends to contain more candidate repair values. For the inconsistent data containing CFDPs to detect incomplete and repair difficult problems in databases, we propose a dependency lifting algorithm (DLA) based on the maximum dependency set (MDS) and a reparation algorithm (C-Repair) based on integrating the minimum cost and attribute correlation, respectively. In detection, we find recessive dependencies from the original dependency set to obtain the MDS and improve the original algorithm by dynamic domain adjustment, which extends the applicability to continuous attributes and improves the detection accuracy. In reparation, we first set up a priority queue (PQ) for elements to be repaired based on the minimum cost idea to select a candidate element; then, we treat the corresponding conflict-free instance ( I n v ) as the training set to learn the correlation among attributes and compute the weighted distance (WDis) between the tuple of the candidate element and other tuples in I n v according to the correlation; and, lastly, we perform reparation based on the WDis and re-compute the PQ after each reparation round to improve the efficiency, and use a label, flag, to mark the repaired elements to ensure the convergence at the same time. By setting up a contrast experiment, we compare the DLA with the CFDPs based algorithm, and the C-Repair with a cost-based, interpolation-based algorithm on a simulated instance and a real instance. From the experimental results, the DLA and C-Repair algorithms have better detection and repair ability at a higher time cost.
展开▼
机译:在银行,政府和互联网企业,数据不一致的问题可能会经常出现在各种信息系统收集,处理和更新由于人为或设备原因的数据。数据不一致的出现使得人们不可能从数据中获得正确的信息,并降低其可用性。这样的问题可能是数据密集型企业,这将导致巨大的经济损失是致命的。此外,它是在数据库干净不一致的数据非常困难,特别是对含有内置谓词(CFDPs)条件函数依赖的数据,因为它往往包含多个候选修复价值。对于含有CFDPs不完全检测和修复困难的问题在数据库中的数据不一致,我们提出根据基于积分最小成本和最大依赖集(MDS)和一个补偿算法(C-修复)的依赖关系的提升算法(DLA)属性的相关性,分别。在检测中,我们发现从原始依赖集以获得MDS和改善动态域调整,它扩展了适用于连续属性,并提高了检测精度的原始算法隐性的依赖关系。在补偿,我们首先设置用于基于所述最小成本想法选择候选元素被修复元件优先队列(PQ);然后,我们把相应的无冲突的实例(I N v)作为训练集学习属性之间的相关性,并计算候选元素的元组和其他元组之间的加权距离(WDIS)在I N V由所述相关性;并且,最后,大家各自赔偿一轮后进行基于WDIS赔偿和重新计算的PQ提高工作效率,并使用标签,标志,标记修复要素,以确保在同一时间收敛。通过建立一个对比实验,我们比较基于算法的CFDPs的DLA和C-修复与上一个模拟实例的成本为基础的,基于插值算法和一个真正的实例。从实验结果来看,DLA和C-修复算法具有较高的时间成本更好的检测和修复能力。
展开▼