Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation

Pei Li; Chaofan Dai; Wenqian Wang

首页> 外文期刊>Symmetry >Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation

【24h】

Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation

机译：基于最大依赖集和属性关联的不一致数据清除

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In banks, governments, and Internet companies, inconsistent data problems may often arise when various information systems are collecting, processing, and updating data due to human or equipment reasons. The emergence of inconsistent data makes it impossible to obtain correct information from the data and reduces its availability. Such problems may be fatal in data-intensive enterprises, which causes huge economic losses. Moreover, it is very difficult to clean inconsistent data in databases, especially for data containing conditional functional dependencies with built-in predicates (CFDPs), because it tends to contain more candidate repair values. For the inconsistent data containing CFD P s to detect incomplete and repair difficult problems in databases, we propose a dependency lifting algorithm (DLA) based on the maximum dependency set (MDS) and a reparation algorithm (C-Repair) based on integrating the minimum cost and attribute correlation, respectively. In detection, we find recessive dependencies from the original dependency set to obtain the MDS and improve the original algorithm by dynamic domain adjustment, which extends the applicability to continuous attributes and improves the detection accuracy. In reparation, we first set up a priority queue (PQ) for elements to be repaired based on the minimum cost idea to select a candidate element; then, we treat the corresponding conflict-free instance ( I n v ) as the training set to learn the correlation among attributes and compute the weighted distance ( WDis ) between the tuple of the candidate element and other tuples in I n v according to the correlation; and, lastly, we perform reparation based on the WDis and re-compute the PQ after each reparation round to improve the efficiency, and use a label, flag , to mark the repaired elements to ensure the convergence at the same time. By setting up a contrast experiment, we compare the DLA with the CFD P s based algorithm, and the C-Repair with a cost-based, interpolation-based algorithm on a simulated instance and a real instance. From the experimental results, the DLA and C-Repair algorithms have better detection and repair ability at a higher time cost.

机译：在银行，政府和Internet公司中，由于人力或设备原因，当各种信息系统正在收集，处理和更新数据时，经常会出现不一致的数据问题。不一致的数据的出现使得无法从数据中获取正确的信息并降低了其可用性。这些问题在数据密集型企业中可能是致命的，从而造成巨大的经济损失。此外，很难清除数据库中不一致的数据，尤其是对于包含带有内置谓词（CFDP）的条件功能依赖项的数据，因为它倾向于包含更多的候选修复值。对于包含CFD P s的不一致数据以检测数据库中的不完整和修复难题，我们提出了基于最大依赖集（MDS）的依赖提升算法（DLA）和基于最小集的补偿算法（C-Repair）成本和属性相关性。在检测中，我们从原始依赖集中找到隐性依赖，以获得MDS，并通过动态域调整来改进原始算法，从而将适用性扩展到连续属性，并提高了检测精度。作为补偿，我们首先根据最小成本的想法为要修复的元素设置一个优先队列（PQ），以选择一个候选元素。然后，将对应的无冲突实例（I n v）作为训练集，以学习属性之间的相关性，并根据相关性计算候选元素的元组与I n v中其他元组之间的加权距离（WDis）;最后，我们基于WDis进行修复，并在每次修复后重新计算PQ以提高效率，并使用label标记来标记修复的元素以确保收敛。通过设置对比实验，我们在模拟实例和真实实例上将DLA与基于CFD P s的算法进行了比较，并将C-Repair与基于成本的基于插值的算法进行了比较。从实验结果来看，DLA和C-Repair算法具有更好的检测和修复能力，但时间成本较高。

著录项

来源
《Symmetry》 |2018年第10期|共24页
作者
Pei Li; Chaofan Dai; Wenqian Wang;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类生理学;
关键词
inconsistent dataCFDs with built-in predicatesmaximum dependency setminimum costattribute correlationmachine learning;

机译：内置谓词不一致的数据差价合约最大依赖集最小成本属性关联机器学习;

相似文献

外文文献
中文文献
专利

1. Discovering attributes dependency for categorical data set based on soft set theory for better decision making [J] . Mohd Isa Awang, Ahmad Nazari Mohd Rose, Fadhilah Ahmad, Applied mathematical sciences . 2015,第130期

机译：基于软集理论发现分类数据集的属性相关性，以更好地进行决策
2. On reduction of attributes in inconsistent decision tables based on information entropies and stripped quotient sets [J] . Nguyen Ngoc Thuy, Wongthanavasu Sartra Expert Systems with Application . 2019,第DECa期

机译：基于信息熵和去除商集的不一致决策表中的属性约简
3. Extended rough set-based attribute reduction in inconsistent incomplete decision systems [J] . Meng Z., Shi Z. Information Sciences: An International Journal . 2012,第Null期

机译：不一致不完整决策系统中基于扩展粗糙集的属性约简
4. Maximum Attribute Relative Approach of Soft Set Theory in Selecting Cluster Attribute of Electronic Government Data Set [C] . Deden Witarsyah Jacob, Iwan Tri Riyadi Yanto, Mohd Farhan Md Fudzee, International Conference on Soft Computing and Data Mining . 2018

机译：软件理论在选择电子政府数据集的集群属性中的最大属性相对方法
5. A Comparison of the Quality of Rule Induction from Inconsistent Data Sets and Incomplete Data Sets. [D] . Su, Xiaomeng. 2015

机译：来自不一致数据集和不完整数据集的规则归纳质量的比较。
6. Behavior-Based Cleaning for Unreliable RFID Data Sets [O] . Hua Fan, Quanyuan Wu, Yisong Lin 2012

机译：基于行为的清理用于不可靠的RFID数据集
7. Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation [O] . Pei Li, Chaofan Dai, Wenqian Wang 2018

机译：基于最大依赖性集和属性相关性的数据清洁不一致

Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation

摘要

著录项

相似文献

相关主题

期刊订阅