【24h】

Efficient Algorithms for Grouping Data to Improve Data Quality

机译:用于分组数据以提高数据质量的高效算法

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Improving and maintaining data quality has become a critical issue for many companies and organizations because poor data degrades organizational performance whereas quality data results in cost saving and customer satisfaction. Activities such as identifying and removing "duplicate" database records from a single database, and correlating records, which identify the same real world "entity", from different databases are used routinely to improve data quality. Due to the large size of the data sources having several hundred millions to several billions records, and continuously growing, efficient techniques and algorithms are needed. One approach to speed up the processing is to use a two-step process, where potential candidate records are grouped together in step one and each group is further processed and analyzed in step two. The record grouping problem is a formal formulation of what needs to be done in step one. This paper introduces a record grouping problem called transitive closure problem, and proposes algorithms to solve the problem. The proposed algorithms have been implemented efficiently in several ways. The paper reports on the empirical study of the implementations of the proposed algorithms.
机译:对于许多公司和组织而言,改善和保持数据质量已成为一个关键问题,因为不良数据会降低组织绩效,而质量数据则会节省成本并提高客户满意度。常规使用诸如从单个数据库中识别和删除“重复”数据库记录,以及从不同数据库中关联标识同一真实世界“实体”的记录之类的活动来提高数据质量。由于具有数亿至数十亿条记录的数据源的规模庞大,并且不断增长,因此需要高效的技术和算法。一种加快处理速度的方法是使用两步过程,其中在步骤1中将潜在的候选记录分组在一起,然后在步骤2中对每个组进行进一步处理和分析。记录分组问题是第一步中需要完成的工作的正式表述。本文介绍了一种称为传递闭包问题的记录分组问题,并提出了解决该问题的算法。所提出的算法已经以几种方式有效地实现了。本文报告了对所提出的算法的实现的实证研究。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号