首页> 外文会议>2011 International Conference on Data and Knowledge Engineering >A generalization of blocking and windowing algorithms for duplicate detection
【24h】

A generalization of blocking and windowing algorithms for duplicate detection

机译:重复检测的阻塞和加窗算法的一般化

获取原文

摘要

Duplicate detection is the process of finding multiple records in a dataset that represent the same real-world entity. Due to the enormous costs of an exhaustive comparison, typical algorithms select only promising record pairs for comparison. Two competing approaches are blocking and windowing. Blocking methods partition records into disjoint subsets, while windowing methods, in particular the Sorted Neighborhood Method, slide a window over the sorted records and compare records only within the window. We present a new algorithm called Sorted Blocks in several variants, which generalizes both approaches. To evaluate Sorted Blocks, we have conducted extensive experiments with different datasets. These show that our new algorithm needs fewer comparisons to find the same number of duplicates.
机译:重复检测是在数据集中查找代表同一真实世界实体的多个记录的过程。由于详尽比较的巨大成本,典型算法仅选择有希望的记录对进行比较。阻塞和窗口化是两种相互竞争的方法。阻塞方法将记录划分为不相交的子集,而窗口方法(尤其是“排序邻域方法”)将窗口滑动到排序记录上方,并仅在窗口内比较记录。我们在几种变体中提出了一种称为排序块的新算法,该算法概括了两种方法。为了评估排序块,我们对不同的数据集进行了广泛的实验。这些表明我们的新算法需要较少的比较才能找到相同数量的重复项。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号