首页> 外文会议>International conference on very large data bases >Streaming Quotient Filter: A Near Optimal Approximate Duplicate Detection Approach for Data Streams
【24h】

Streaming Quotient Filter: A Near Optimal Approximate Duplicate Detection Approach for Data Streams

机译:流式源滤波器:数据流的近最佳近似重复检测方法

获取原文

摘要

The unparalleled growth and popularity of the Internet coupled with the advent of diverse modern applications such as search engines, on-line transactions, climate warning systems, etc., has catered to an unprecedented expanse in the volume of data stored world-wide. Efficient storage, management, and processing of such massively exponential amount of data has emerged as a central theme of research in this direction. Detection and removal of redundancies and duplicates in real-time from such multi-trillion record-set to bolster resource and compute efficiency constitutes a challenging area of study. The infeasibility of storing the entire data from potentially unbounded data streams, with the need for precise elimination of duplicates calls for intelligent approximate duplicate detection algorithms. The literature hosts numerous works based on the well-known probabilistic bitmap structure, Bloom Filter and its variants. In this paper we propose a novel data structure, Streaming Quotient Filter, (SQP) for efficient detection and removal of duplicates in data streams. SQF intelligently stores the signatures of elements arriving on a data stream, and along with an eviction policy provides near zero false positive and false negative rates. We show that the near optimal performance of SQF is achieved with a very low memory requirement, making it ideal for real-time memory-efficient de-duplication applications having an extremely low false positive and false negative tolerance rates. We present detailed theoretical analysis of the working of SQF, providing a guarantee on its performance. Empirically, we compare SQF to alternate methods and show that the proposed method is superior in terms of memory and accuracy compared to the existing solutions. We also discuss Dynamic SQF for evolving streams and the parallel implementation of SQF.
机译:互联网无与伦比的增长和普及加上了各种现代应用的出现,如搜索引擎,在线交易,气候警告系统等,已经迎合了世界范围内存储的数据量的前所未有的扩展。高效的存储,管理和处理如此大规模指数数据的数据被朝着这种方向作为研究的中央主题。从这种多亿录像集的实时检测和删除冗余和重复,以升高的资源和计算效率构成了一个具有挑战性的研究领域。从潜在的无界数据流存储整个数据的可行性,需要精确消除重复的智能近似重复检测算法。该文献基于众所周知的概率位图结构,绽放过滤器及其变体来占多种作品。在本文中,我们提出了一种新的数据结构,流传输商滤波器(SQP),以便有效地检测和删除数据流中的重复。 SQF智能地存储到达数据流的元素的签名,以及驱逐策略提供接近零假正和假负率。我们表明SQF的近最佳性能是通过非常低的内存要求实现的,使其成为具有极低假阳性和假负容差率的实时记忆效率的重复应用。我们对SQF的工作提供了详细的理论分析,提供了对其性能的保证。根据经验,我们比较SQF到替代方法,并表明所提出的方法是在比现有的解决方案的存储器和精度方面优异。我们还讨论动态SQF以实现演化流和SQF的并行实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号