首页> 外文会议>Database and Expert Systems Applications >PC-Filter: A Robust Filtering Technique for Duplicate Record Detection in Large Databases
【24h】

PC-Filter: A Robust Filtering Technique for Duplicate Record Detection in Large Databases

机译:PC过滤器:一种用于大型数据库中重复记录检测的鲁棒过滤技术

获取原文

摘要

In this paper, we will propose PC-Filter (PC stands for Partition Comparison), a robust data filter for approximately duplicate record detection in large databases. PC-Filter distinguishes itself from all of existing methods by using the notion of partition in duplicate detection. It first sorts the whole database and splits the sorted database into a number of record partitions. The Partition Comparison Graph (PCG) is then constructed by performing fast partition pruning. Finally, duplicate records are effectively detected by using internal and external partition comparison based on PCG. Four properties, used as heuristics, have been devised to achieve a remarkable efficiency of the filter based on triangle inequity of record similarity. PC-Filter is insensitive to the key used to sort the database, and can achieve a very good recall level that is comparable to that of the pair-wise record comparison method but only with a complexity of O(N~(4/3)). Equipping existing detection methods with PC-Filter, we are able to well solve the "Key Selection" problem, the "Scope Specification" problem and the "Low Recall" problem that the existing methods suffer from.
机译:在本文中,我们将提出PC-Filter(PC代表分区比较),它是一种健壮的数据过滤器,可用于大型数据库中近似重复的记录检测。 PC-Filter通过使用重复检测中的分区概念将自己与所有现有方法区分开。它首先对整个数据库进行排序,然后将排序后的数据库拆分为多个记录分区。然后通过执行快速分区修剪来构造分区比较图(PCG)。最后,通过使用基于PCG的内部和外部分区比较,可以有效地检测重复记录。已经设计了四个属性,用作启发式算法,以基于记录相似性的三角形不等式实现滤波器的显着效率。 PC-Filter对用于对数据库进行排序的密钥不敏感,并且可以实现非常好的召回级别,该召回级别可以与成对记录比较方法相提并论,但是其复杂度为O(N〜(4/3) )。通过为PC-Filter配备现有的检测方法,我们可以很好地解决现有方法遭受的“密钥选择”问题,“范围规格”问题和“低召回率”问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号