首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication
【24h】

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

机译:可伸缩记录链接和重复数据删除的索引技术概述

获取原文
获取原文并翻译 | 示例
       

摘要

Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many application areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today's databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of 12 variations of 6 indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. No such detailed survey has so far been published.
机译:记录链接是匹配来自引用相同实体的多个数据库中的记录的过程。当应用于单个数据库时,此过程称为重复数据删除。匹配的数据在许多应用领域中正变得越来越重要,因为匹配的数据可能包含否则无法获得的信息或获取成本太高的信息。删除单个数据库中的重复记录是数据清理过程中的关键步骤,因为重复会严重影响任何后续数据处理或数据挖掘的结果。随着当今数据库规模的增加,匹配过程的复杂性成为记录链接和重复数据删除的主要挑战之一。近年来,已开发出各种索引技术来实现记录链接和重复数据删除。它们旨在通过删除明显的不匹配对来减少匹配过程中要比较的记录对的数量,同时保持较高的匹配质量。本文介绍了对6种索引技术的12种变化的调查。分析了它们的复杂性,并在实验框架中使用合成和真实数据集评估了它们的性能和可伸缩性。到目前为止,还没有公开这样的详细调查。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号