Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters

TIMOTHY DE VRIES; HUI KE; SANJAY CHAWLA; PETER CHRISTEN

首页> 外文期刊>ACM transactions on knowledge discovery from data >Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters

【24h】

Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters

机译：使用后缀数组和Bloom过滤器的稳健记录链接阻止

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, quadratic scalability for the brute force approach of comparing all possible pairs of records necessitates the design of appropriate indexing or blocking techniques. The aim of these techniques is to cheaply remove candidate record pairs that are unlikely to match. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlight the importance of using efficient indexing and blocking in real-world applications where datasets contain millions of records. We extend our disk-based methods with the capability to utilise main memory based storage to construct Bloom filters, which we have found to cause significant speedup by reducing the number of costly database queries by up to 70% in real data. We give practical implementation details and show how Bloom filters can be easily applied to Suffix Array based indexing.

机译：记录链接是一项重要的数据集成任务，在大型和多种数据库中具有许多实际用途，用于匹配，合并和重复删除。但是，用于比较所有可能的记录对的蛮力方法的二次可伸缩性需要设计适当的索引或阻止技术。这些技术的目的是廉价地删除不太可能匹配的候选记录对。我们设计和评估基于后缀数组的高效且高度可扩展的阻塞方法。我们的后缀分组技术利用索引使用的顺序以少量的额外成本合并相似的块，从而在保持基本后缀数组方法的高可伸缩性的同时，提供了更高的准确性。使用滑动窗口技术可以有效地对相似的后缀进行分组。我们对我们的方法进行了深入的分析，并显示了使用真实和合成数据的实验结果，这突显了在数据集包含数百万条记录的实际应用中使用有效索引和阻塞的重要性。我们扩展了基于磁盘的方法，使其能够利用基于主内存的存储来构造Bloom筛选器，我们发现通过将实际数据中的昂贵数据库查询数量减少多达70％，可以显着提高速度。我们提供了实用的实现细节，并展示了Bloom过滤器如何轻松地应用于基于后缀数组的索引。

著录项

来源
《ACM transactions on knowledge discovery from data》 |2011年第2期|p.9.1-9.27|共27页
作者
TIMOTHY DE VRIES; HUI KE; SANJAY CHAWLA; PETER CHRISTEN;
展开▼
作者单位

School of Information Technologies, University of Sydney, Sydney NSW 2006, Australia;

School of Information Technologies, University of Sydney, Sydney NSW 2006, Australia;

School of Information Technologies, University of Sydney, Sydney NSW 2006, Australia;

College of Engineering and Computer Science, Australian National University (ANU), Canberra ACT 0200, Australia;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
record linkage; blocking; suffix arrays;

机译：记录链接;阻塞后缀数组;

相似文献

外文文献
中文文献
专利

1. Improved suffix blocking for record linkage and entity resolution [J] . Allam Amin, Skiadopoulos Spiros, Kalnis Panos Data & Knowledge Engineering . 2018,第SEPa期

机译：改进的后缀阻塞，以实现记录链接和实体解析
2. Composite Bloom Filters for Secure Record Linkage [J] . Durham E.A., Kantarcioglu M., Xue Y., Knowledge and Data Engineering, IEEE Transactions on . 2014,第12期

机译：用于安全记录链接的复合Bloom过滤器
3. Privacy-preserving record linkage using Bloom filters [J] . Rainer Schnell, Tobias Bachteler, J?rg Reiher BMC Medical Informatics and Decision Making . 2009,第1期

机译：使用Bloom过滤器的隐私保护记录链接
4. Robust Record Linkage Blocking using Suffix Arrays [C] . Timothy de Vries, Hui Ke, Peter Christen, ACM conference on information and knowledge management;CIKM 09 . 2009

机译：使用后缀数组的稳健记录链接阻止
5. A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage [D] . Ranbaduge, Thilina. 2018

机译：多数据库隐私保护记录链接的可扩展阻止框架
6. Privacy-preserving record linkage using Bloom filters [O] . Rainer Schnell, Tobias Bachteler, Jörg Reiher 2009

机译：使用Bloom过滤器的隐私保护记录链接
7. Robust record linkage blocking using suffix arrays and Bloom filters [O] . Timothy De Vries, Hui Ke, Peter Christen, 2014

机译：使用后缀数组和Bloom过滤器进行健壮的记录链接阻止

Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅