首页> 外文学位 >Domain-independent de-duplication in data warehouse cleaning.
【24h】

Domain-independent de-duplication in data warehouse cleaning.

机译:数据仓库清理中与域无关的重复数据删除。

获取原文
获取原文并翻译 | 示例

摘要

Many organizations collect large amounts of data to support their business and decision-making processes. The data collected originate from a variety of sources that may have inherent data quality problems. These problems become more pronounced when heterogeneous data sources are integrated to build data warehouses. Data warehouses integrating huge amounts of data from a number of heterogeneous data sources, are used to support decision-making and on-line analytical processing. The integrated databases inherit the data quality problems that were present in the source databases, and also have data quality problems arising from the integration process. The data in the integrated systems (especially data warehouses) need to be cleaned for reliable decision support querying.; A major problem that arises from integrating different databases is the existence of duplicates. The challenge of de-duplication is identifying “equivalent” records within the database. Most published research in de-duplication propose techniques that rely heavily on domain knowledge. A few others propose solutions that are partially domain-independent. This thesis identifies two levels of domain-independence in de-duplication namely: domain-independence at the attribute level, and domain-independence at the record level. The thesis then proposes a positional algorithm that achieves domain-independent de-duplication at the attribute level. The thesis also proposes a technique for field weighting by data profiling, which, when used with the positional algorithm, achieves domain-independent de-duplication at the record level. Experiments show that the positional algorithm achieves more accurate de-duplication than the existing algorithms. Experiments also show that the data profiling technique for field weighting effectively assigns field weights for de-duplication purposes.
机译:许多组织收集大量数据以支持其业务和决策流程。收集的数据来自可能具有固有数据质量问题的各种来源。当集成异构数据源以构建数据仓库时,这些问题变得更加明显。数据仓库集成了来自许多异构数据源的大量数据,用于支持决策和在线分析处理。集成数据库继承了源数据库中存在的数据质量问题,并且还具有集成过程中引起的数据质量问题。集成系统(尤其是数据仓库)中的数据需要清洗以进行可靠的决策支持查询。集成不同数据库所引起的主要问题是重复项的存在。重复数据删除的挑战是识别数据库中的“等效”记录。在重复数据删除方面,大多数已发表的研究都提出了严重依赖领域知识的技术。其他一些人提出的解决方案部分与领域无关。本文确定了重复数据删除中的域独立性的两个级别,即:属性级的域独立性和记录级的域独立性。然后,论文提出了一种在属性级别实现与域无关的重复数据删除的位置算法。本文还提出了一种通过数据分析进行字段加权的技术,该技术与位置算法配合使用时,可以在记录级别实现与域无关的重复数据删除。实验表明,与现有算法相比,位置算法具有更高的重复数据删除精度。实验还表明,用于字段加权的数据概要分析技术有效地分配了字段权重以进行重复数据删除。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号