【24h】

A Precise Blocking Method for Record Linkage

机译:一种精确的记录链接阻塞方法

获取原文

摘要

Identifying approximately duplicate records between databases requires the costly computation of distances between their attributes. Thus duplicate detection is usually performed in two phases, an efficient blocking phase that determines few potential candidate duplicates based on simple criteria, followed by a second phase performing an in-depth comparison of the candidate duplicates. This paper introduces and evaluates a precise and efficient approach for the blocking phase, which requires only standard indices, but performs as well as other approaches based on special purpose indices, and outperforms other approaches based on standard indices. The key idea of the approach is to use a comparison window with a size that depends dynamically on a maximum distance, rather than using a window with fixed size.
机译:识别数据库之间的近似重复记录需要其属性之间的距离的昂贵计算。因此,重复检测通常以两个阶段执行,是基于简单标准确定几个潜在候选重复的有效阻塞阶段,然后执行候选重复的深度比较的第二阶段。本文介绍并评估了阻塞阶段的精确和有效的方法,只需要标准指数,但基于特殊用途指数执行以及其他方法,并且基于标准指标优越其他方法。该方法的关键概念是使用具有尺寸的比较窗口在最大距离上动态依赖,而不是使用具有固定大小的窗口。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号