Scalable Blocking for Very Large Databases

机译：非常大的数据库可扩展阻塞

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In the field of database deduplication, the goal is to find approximately matching records within a database. Blocking is a typical stage in this process that involves cheaply finding candidate pairs of records that are potential matches for further processing. We present here Hashed Dynamic Blocking, a new approach to blocking designed to address datasets larger than those studied in most prior work. Hashed Dynamic Blocking (HDB) extends Dynamic Blocking, which leverages the insight that rare matching values and rare intersections of values are predictive of a matching relationship. We also present a novel use of Locality Sensitive Hashing (LSH) to build blocking key values for huge databases with a convenient configuration to control the trade-off between precision and recall. HDB achieves massive scale by minimizing data movement, using compact block representation, and greedily pruning ineffective candidate blocks using a Count-min Sketch approximate counting data structure. We benchmark the algorithm by focusing on real-world datasets in excess of one million rows, demonstrating that the algorithm displays linear time complexity scaling in this range. Furthermore, we execute HDB on a 530 million row industrial dataset, detecting 68 billion candidate pairs in less than three hours at a cost of $307 on a major cloud service.

机译：在数据库重复数据删除领域，目标是在数据库中找到大约匹配的记录。阻断是该过程中的典型阶段，其涉及廉价地发现候选的记录对作为进一步处理的潜在匹配。我们在这里展示了动态阻塞，一种封锁的新方法，旨在解决比在大多数事先工作中研究的数据集大。散列动态阻塞（HDB）扩展了动态阻塞，其利用罕见匹配值和稀有价值的罕见交叉点来预测匹配关系。我们还提出了一种新颖的使用位置敏感散列（LSH）来构建具有方便配置的庞大数据库的阻止键值，以控制精度和召回之间的权衡。 HDB通过最小化数据移动，使用紧凑的块表示，以及使用Count-Min草图近似计数数据结构的贪婪修剪无效候选块来实现大量规模。我们通过专注于超大一百万行的真实数据集来基准算法，展示算法在此范围内显示线性时间复杂性缩放。此外，我们在5.3亿行工业数据集上执行HDB，在较大的云服务上以307美元的价格检测680亿候选人对。

著录项

来源
《European Conference on Machine Learning;European Conference on Principles and Practice of Knowledge Discovery in Databases》|2020年|xv 607p|共17页
会议地点
作者
Andrew Borthwick; Stephen Ash; Bin Pang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TF3-532;
关键词
Duplicate detection; Blocking; Entity matching; Record linkage;

机译：重复检测;阻塞;实体匹配;记录联动;

相似文献

外文文献
中文文献
专利

1. Representation of Synortic-Scale Rossby Wave Packets and Blocking in the S2S Prediction Project Database [J] . Quinting J. F., Vitart F. Geophysical Research Letters . 2019,第2期

机译：SynOrtic Rossby波包的表示和S2S预测项目数据库中的阻塞
2. Local anaesthetic dosage of peripheral nerve blocks in children: analysis of 40 121 blocks from the Pediatric Regional Anesthesia Network database [J] . Suresh S., De Oliveira G. S. Jr. British journal of anaesthesia . 2018,第2期

机译：儿童周围神经障碍的局部麻醉剂量：分析来自儿科区域麻醉网络数据库的40个121个块
3. Local anaesthetic dosage of peripheral nerve blocks in children: analysis of 40 121 blocks from the pediatric regional anesthesia network database (vol 120, pg 317, 2018) [J] . Suresh S., De Oliveira G. S. Jr. British journal of anaesthesia . 2018,第3期

机译：儿童周围神经障碍的局部麻醉剂量：分析来自儿科区域麻醉网络数据库的40121个块（Vol 120，PG 317,2018）
4. Scalable Blocking for Very Large Databases [C] . Andrew Borthwick, Stephen Ash, Bin Pang European Conference on Machine Learning;European Conference on Principles and Practice of Knowledge Discovery in Databases . 2020

机译：非常大的数据库可扩展阻塞
5. PERFORMANCE OF CONCURRENCY CONTROL METHODS IN DISTRIBUTED DATABASE MANAGEMENT SYSTEMS (TIMESTAMP ORDERING, TWO-PHASE LOCKING, OPTIMISTIC SCHEME, RESTART, TRANSACTION BLOCKING) [D] . MOON, SONG CHUN 1985

机译：分布式数据库管理系统中的一致性控制方法的性能（时间戳排序，两阶段锁定，优化方案，重新启动，事务阻止）
6. MM-MDS: A Multidimensional Scaling Database with Similarity Ratings for 240 Object Categories from the Massive Memory Picture Database [O] . Michael C. Hout, Stephen D. Goldinger, Kyle J. Brady -1

机译：MM-MDS：多维缩放数据库具有来自海量内存图片数据库的240个对象类别的相似等级
7. Scalable Blocking for Very Large Databases [O] . Andrew Borthwick, Stephen Ash, Bin Pang, 2020

机译：非常大的数据库可扩展阻止

Scalable Blocking for Very Large Databases

摘要

著录项

相似文献

相关主题

期刊订阅