
Efficient Algorithms for Grouping Data to Improve Data Quality


获取原文并翻译 | 示例


Improving and maintaining data quality has become a critical issue for many companies and organizations because poor data degrades organizational performance whereas quality data results in cost saving and customer satisfaction. Activities such as identifying and removing "duplicate" database records from a single database, and correlating records, which identify the same real world "entity", from different databases are used routinely to improve data quality. Due to the large size of the data sources having several hundred millions to several billions records, and continuously growing, efficient techniques and algorithms are needed. One approach to speed up the processing is to use a two-step process, where potential candidate records are grouped together in step one and each group is further processed and analyzed in step two. The record grouping problem is a formal formulation of what needs to be done in step one. This paper introduces a record grouping problem called transitive closure problem, and proposes algorithms to solve the problem. The proposed algorithms have been implemented efficiently in several ways. The paper reports on the empirical study of the implementations of the proposed algorithms.



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号