首页> 外文期刊>ACM transactions on knowledge discovery from data >Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters
【24h】

Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters

机译:使用后缀数组和Bloom过滤器的稳健记录链接阻止

获取原文
获取原文并翻译 | 示例

摘要

Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, quadratic scalability for the brute force approach of comparing all possible pairs of records necessitates the design of appropriate indexing or blocking techniques. The aim of these techniques is to cheaply remove candidate record pairs that are unlikely to match. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlight the importance of using efficient indexing and blocking in real-world applications where datasets contain millions of records. We extend our disk-based methods with the capability to utilise main memory based storage to construct Bloom filters, which we have found to cause significant speedup by reducing the number of costly database queries by up to 70% in real data. We give practical implementation details and show how Bloom filters can be easily applied to Suffix Array based indexing.
机译:记录链接是一项重要的数据集成任务,在大型和多种数据库中具有许多实际用途,用于匹配,合并和重复删除。但是,用于比较所有可能的记录对的蛮力方法的二次可伸缩性需要设计适当的索引或阻止技术。这些技术的目的是廉价地删除不太可能匹配的候选记录对。我们设计和评估基于后缀数组的高效且高度可扩展的阻塞方法。我们的后缀分组技术利用索引使用的顺序以少量的额外成本合并相似的块,从而在保持基本后缀数组方法的高可伸缩性的同时,提供了更高的准确性。使用滑动窗口技术可以有效地对相似的后缀进行分组。我们对我们的方法进行了深入的分析,并显示了使用真实和合成数据的实验结果,这突显了在数据集包含数百万条记录的实际应用中使用有效索引和阻塞的重要性。我们扩展了基于磁盘的方法,使其能够利用基于主内存的存储来构造Bloom筛选器,我们发现通过将实际数据中的昂贵数据库查询数量减少多达70%,可以显着提高速度。我们提供了实用的实现细节,并展示了Bloom过滤器如何轻松地应用于基于后缀数组的索引。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号