首页> 外文会议>2017 14th Web Information Systems and Applications Conference >A Progressive Method for Detecting Duplication Entities Based on Bloom Filters
【24h】

A Progressive Method for Detecting Duplication Entities Based on Bloom Filters

机译:一种基于布隆过滤器的重复实体检测渐进方法

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

With the volume of data grows rapidly, the cost of detecting duplication entities has increased significantly in data cleaning. However, some real-time applications only need to identify as many duplicate entities as possible in a limited time, rather than all of them. The existing works adopt the sorting method to divide similar records into blocks, and arrange the processing order of blocks to detect duplicate entity progressively. However, this method only works well when the attributes of records are suitable for sorting. Therefore, this paper proposes a novel progressive de-duplicate method for records that can't be sorted by their attributes. The method distributes records into different blocks based on their features and generates a modified bloom filter index for each block. Then it uses the bloom filter to predict the probability of duplicate entities in this block, which determines the processing order of blocks to detect the duplicate entities more quickly. The comprehensive experiment shows that the number of duplicate detection by this algorithm in the finite time is far more efficient than other algorithms involved in the related works.
机译:随着数据量的快速增长,在数据清理中检测重复实体的成本已显着增加。但是,某些实时应用程序仅需要在有限的时间内标识尽可能多的重复实体,而不是全部。现有作品采用排序的方法将相似的记录分为多个块,并按块的处理顺序排列以逐步检测重复的实体。但是,仅当记录的属性适合排序时,此方法才有效。因此,本文针对无法按属性排序的记录提出了一种新颖的渐进式重复数据删除方法。该方法基于记录的特征将记录分配到不同的块中,并为每个块生成修改的Bloom Filter索引。然后,它使用布隆过滤器预测该块中重复实体的可能性,从而确定块的处理顺序以更快地检测到重复实体。综合实验表明,该算法在有限时间内进行重复检测的次数比相关工作中涉及的其他算法效率更高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号