Domain-independent de-duplication in data warehouse cleaning.

机译：数据仓库清理中与域无关的重复数据删除。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Many organizations collect large amounts of data to support their business and decision-making processes. The data collected originate from a variety of sources that may have inherent data quality problems. These problems become more pronounced when heterogeneous data sources are integrated to build data warehouses. Data warehouses integrating huge amounts of data from a number of heterogeneous data sources, are used to support decision-making and on-line analytical processing. The integrated databases inherit the data quality problems that were present in the source databases, and also have data quality problems arising from the integration process. The data in the integrated systems (especially data warehouses) need to be cleaned for reliable decision support querying.; A major problem that arises from integrating different databases is the existence of duplicates. The challenge of de-duplication is identifying “equivalent” records within the database. Most published research in de-duplication propose techniques that rely heavily on domain knowledge. A few others propose solutions that are partially domain-independent. This thesis identifies two levels of domain-independence in de-duplication namely: domain-independence at the attribute level, and domain-independence at the record level. The thesis then proposes a positional algorithm that achieves domain-independent de-duplication at the attribute level. The thesis also proposes a technique for field weighting by data profiling, which, when used with the positional algorithm, achieves domain-independent de-duplication at the record level. Experiments show that the positional algorithm achieves more accurate de-duplication than the existing algorithms. Experiments also show that the data profiling technique for field weighting effectively assigns field weights for de-duplication purposes.

机译：许多组织收集大量数据以支持其业务和决策流程。收集的数据来自可能具有固有数据质量问题的各种来源。当集成异构数据源以构建数据仓库时，这些问题变得更加明显。数据仓库集成了来自许多异构数据源的大量数据，用于支持决策和在线分析处理。集成数据库继承了源数据库中存在的数据质量问题，并且还具有集成过程中引起的数据质量问题。集成系统（尤其是数据仓库）中的数据需要清洗以进行可靠的决策支持查询。集成不同数据库所引起的主要问题是重复项的存在。重复数据删除的挑战是识别数据库中的“等效”记录。在重复数据删除方面，大多数已发表的研究都提出了严重依赖领域知识的技术。其他一些人提出的解决方案部分与领域无关。本文确定了重复数据删除中的域独立性的两个级别，即：属性级的域独立性和记录级的域独立性。然后，论文提出了一种在属性级别实现与域无关的重复数据删除的位置算法。本文还提出了一种通过数据分析进行字段加权的技术，该技术与位置算法配合使用时，可以在记录级别实现与域无关的重复数据删除。实验表明，与现有算法相比，位置算法具有更高的重复数据删除精度。实验还表明，用于字段加权的数据概要分析技术有效地分配了字段权重以进行重复数据删除。

著录项

作者
Udechukwu, Ajumobi Okwuchukwu.;
展开▼
作者单位

University of Windsor (Canada).;

展开▼
授予单位 University of Windsor (Canada).;
学科 Computer Science.; Information Science.
学位 M.Sc.
年度 2002
页码 72 p.
总页数 72
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;信息与知识传播;
关键词
入库时间 2022-08-17 11:46:39

相似文献

外文文献
中文文献
专利

1. De-Duplication Scheduling Strategy in Real-Time Data Warehouse [J] . Jie Song, Hui Liu, JinBo Wu, The Open Cybernetics & Systemics Journal . 2017,第1期

机译：实时数据仓库中的重复数据删除调度策略
2. Fuzzy-Rule Based Adaptive Data Warehouse: An Extension of Data Warehouse as Knowledge Warehouse [J] . Rajdev Tiwari, Anitbhav Tiwari, Manu Pratap Singh International journal of applied evolutionary computation . 2012,第1期

机译：基于模糊规则的自适应数据仓库：数据仓库作为知识仓库的扩展
3. Hengam a MapReduce-Based Distributed Data Warehouse for Big Data: A MapReduce-Based Distributed Data Warehouse for Big Data [J] . Mohammadhossein Barkhordari, Mahdi Niamanesh International journal of artificial life research . 2018,第1期

机译：Hengam基于MapReduce的大数据分布式数据仓库：基于MapReduce的大数据分布式数据仓库
4. An effective data storage model for cloud databases using temporal data de-duplication approach [C] . S. Muthurajkumar, M. Vijayalakshmi, A. Kannan International Conference on Advanced Computing . 2017

机译：使用时间数据解复复方法的云数据库有效数据存储模型
5. A data warehouse solution: A fund-raising data warehouse [D] . Bei, Joyce Yuan 2010

机译：数据仓库解决方案：筹款数据仓库
6. COVID-WAREHOUSE: A Data Warehouse of Italian COVID-19 Pollution and Climate Data [O] . Giuseppe Agapito, Chiara Zucco, Mario Cannataro 2020

机译：Covid-Warehouse：意大利Covid-19污染和气候数据的数据仓库
7. Domain-independent de-duplication in data warehouse cleaning. [O] . Udechukwu Ajumobi Okwuchukwu. 2002

机译：数据仓库清理中与域无关的重复数据删除。

Domain-independent de-duplication in data warehouse cleaning.

摘要

著录项

相似文献

相关主题

期刊订阅