首页> 外文OA文献 >Domain-independent de-duplication in data warehouse cleaning.
【2h】

Domain-independent de-duplication in data warehouse cleaning.

机译:数据仓库清理中与域无关的重复数据删除。

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Many organizations collect large amounts of data to support their business and decision-making processes. The data collected originate from a variety of sources that may have inherent data quality problems. These problems become more pronounced when heterogeneous data sources are integrated to build data warehouses. Data warehouses integrating huge amounts of data from a number of heterogeneous data sources, are used to support decision-making and on-line analytical processing. The integrated databases inherit the data quality problems that were present in the source databases, and also have data quality problems arising from the integration process. The data in the integrated systems (especially data warehouses) need to be cleaned for reliable decision support querying. A major problem that arises from integrating different databases is the existence of duplicates. The challenge of de-duplication is identifying u22equivalentu22 records within the database. Most published research in de-duplication propose techniques that rely heavily on domain knowledge. A few others propose solutions that are partially domain-independent. This thesis identifies two levels of domain-independence in de-duplication namely: domain-independence at the attribute level, and domain-independence at the record level. The thesis then proposes a positional algorithm that achieves domain-independent de-duplication at the attribute level. The thesis also proposes a technique for field weighting by data profiling, which, when used with the positional algorithm, achieves domain-independent de-duplication at the record level. Experiments show that the positional algorithm achieves more accurate de-duplication than the existing algorithms. Experiments also show that the data profiling technique for field weighting effectively assigns field weights for de-duplication purposes. Paper copy at Leddy Library: Theses u26 Major Papers - Basement, West Bldg. / Call Number: Thesis2002 .U34. Source: Masters Abstracts International, Volume: 41-04, page: 1123. Adviser: Christie I. Ezeife. Thesis (M.Sc.)--University of Windsor (Canada), 2002.
机译:许多组织收集大量数据以支持其业务和决策流程。收集的数据来自可能具有固有数据质量问题的各种来源。当集成异构数据源以构建数据仓库时,这些问题变得更加明显。数据仓库集成了来自许多异构数据源的大量数据,用于支持决策和在线分析处理。集成数据库继承了源数据库中存在的数据质量问题,并且还具有集成过程中引起的数据质量问题。集成系统(尤其是数据仓库)中的数据需要清理,以进行可靠的决策支持查询。集成不同数据库所引起的主要问题是重复项的存在。重复数据删除的挑战是在数据库中标识 u22equivalent u22记录。在重复数据删除方面,大多数已发表的研究都提出了严重依赖领域知识的技术。其他一些人提出的解决方案部分与领域无关。本文确定了重复数据删除中的域独立性的两个级别,即:属性级的域独立性和记录级的域独立性。然后,论文提出了一种在属性级别实现与域无关的重复数据删除的位置算法。本文还提出了一种通过数据分析进行字段加权的技术,该技术与位置算法配合使用时,可以在记录级别实现与域无关的重复数据删除。实验表明,与现有算法相比,位置算法具有更高的重复数据删除精度。实验还表明,用于字段加权的数据概要分析技术有效地分配了字段权重以进行重复数据删除。莱迪图书馆的纸质副本:论文主要论文-西楼地下室。 /电话:Thesis2002 .U34。资料来源:国际硕士摘要,第41-04卷,第1123页。顾问:克里斯蒂·伊泽菲夫(Christie I. Ezeife)。论文(硕士)-温莎大学(加拿大),2002。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号