首页> 外文会议>SAI Computing Conference >A density-based data cleaning approach for deduplication with data consistency and accuracy
【24h】

A density-based data cleaning approach for deduplication with data consistency and accuracy

机译:基于密度的数据清洁方法,用于重复数据删除,数据一致性和准确性

获取原文

摘要

Data cleaning is a critical part of the data transformation stage in data warehousing where the extracted data from relational databases are usually unclean. This may affect critical tasks in different organizations such as data analysis and decision making. Current techniques of data cleaning generally deal with one or two quality aspects. The techniques assume the availability of master data, or that users are involved in data cleaning such as manually placing confidence scores that represent the correctness of the values of data. In this paper, we present a uniform framework and algorithms to integrate data deduplication with inconsistent data repairing and discovering of the accurate values in data. We utilize the embedded density information in data to fix errors based on data density where tuples that are close to each other are packed together. We present a weight model to assign confidence scores that are based on the density of data. The assignments are automated and no user is involved in the process. We consider the inconsistent data in terms of violations with respect to a set of functional dependencies (FDs), as these violations are common in practice. We present a cost model for data repairing that is based on the weight model. We experimentally verify the quality and the scalability of our algorithms. We use synthetic and real datasets.
机译:数据清洁是数据仓库中数据变换阶段的关键部分,其中来自关系数据库的提取数据通常是不洁净的。这可能影响不同组织中的关键任务,例如数据分析和决策。目前的数据清洁技术通常处理一个或两个质量方面。该技术假设主数据的可用性,或者用户参与数据清洁,例如手动放置代表数据值的正确性的置信分数。在本文中,我们介绍了一个统一的框架和算法,以将数据重复数据删除与不一致的数据修复和发现数据中的准确值集成。我们利用数据中的嵌入密度信息来解决基于数据密度的错误,其中彼此靠近彼此堆叠在一起。我们提出了一个重量模型来指定基于数据密度的置信分数。分配是自动的,并且没有用户参与该过程。我们考虑在违规方面对一组功能依赖项(FDS)的违规行为,因为这些违规是常见的。我们提出了一种基于体重模型的数据修复的成本模型。我们通过实验验证我们算法的质量和可扩展性。我们使用合成和真实数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号