Streaming Quotient Filter: A Near Optimal Approximate Duplicate Detection Approach for Data Streams

机译：流式源滤波器：数据流的近最佳近似重复检测方法

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The unparalleled growth and popularity of the Internet coupled with the advent of diverse modern applications such as search engines, on-line transactions, climate warning systems, etc., has catered to an unprecedented expanse in the volume of data stored world-wide. Efficient storage, management, and processing of such massively exponential amount of data has emerged as a central theme of research in this direction. Detection and removal of redundancies and duplicates in real-time from such multi-trillion record-set to bolster resource and compute efficiency constitutes a challenging area of study. The infeasibility of storing the entire data from potentially unbounded data streams, with the need for precise elimination of duplicates calls for intelligent approximate duplicate detection algorithms. The literature hosts numerous works based on the well-known probabilistic bitmap structure, Bloom Filter and its variants. In this paper we propose a novel data structure, Streaming Quotient Filter, (SQP) for efficient detection and removal of duplicates in data streams. SQF intelligently stores the signatures of elements arriving on a data stream, and along with an eviction policy provides near zero false positive and false negative rates. We show that the near optimal performance of SQF is achieved with a very low memory requirement, making it ideal for real-time memory-efficient de-duplication applications having an extremely low false positive and false negative tolerance rates. We present detailed theoretical analysis of the working of SQF, providing a guarantee on its performance. Empirically, we compare SQF to alternate methods and show that the proposed method is superior in terms of memory and accuracy compared to the existing solutions. We also discuss Dynamic SQF for evolving streams and the parallel implementation of SQF.

机译：互联网无与伦比的增长和普及加上了各种现代应用的出现，如搜索引擎，在线交易，气候警告系统等，已经迎合了世界范围内存储的数据量的前所未有的扩展。高效的存储，管理和处理如此大规模指数数据的数据被朝着这种方向作为研究的中央主题。从这种多亿录像集的实时检测和删除冗余和重复，以升高的资源和计算效率构成了一个具有挑战性的研究领域。从潜在的无界数据流存储整个数据的可行性，需要精确消除重复的智能近似重复检测算法。该文献基于众所周知的概率位图结构，绽放过滤器及其变体来占多种作品。在本文中，我们提出了一种新的数据结构，流传输商滤波器（SQP），以便有效地检测和删除数据流中的重复。 SQF智能地存储到达数据流的元素的签名，以及驱逐策略提供接近零假正和假负率。我们表明SQF的近最佳性能是通过非常低的内存要求实现的，使其成为具有极低假阳性和假负容差率的实时记忆效率的重复应用。我们对SQF的工作提供了详细的理论分析，提供了对其性能的保证。根据经验，我们比较SQF到替代方法，并表明所提出的方法是在比现有的解决方案的存储器和精度方面优异。我们还讨论动态SQF以实现演化流和SQF的并行实现。

著录项

来源
《International conference on very large data bases》|2013年||共12页
会议地点
作者
Sourav Dutta; Ankur Narang; Suman K. Bera;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.13;
关键词

相似文献

外文文献
中文文献
专利

1. Improved Streaming Quotient Filter: A Duplicate Detection Approach for Data Streams [J] . Che Shiwei, Yang Wu, Wang Wei The international arab journal of information technology . 2020,第5期

机译：改进的流源滤波器：数据流的重复检测方法
2. Improved Approximate Detection Of Duplicates For Data Streams Over Sliding Windows [J] . Hong Shen, Yu Zhang Journal of Computer Science & Technology . 2008,第6期

机译：改进的滑动窗口上数据流重复项的近似检测
3. Improved Approximate Detection of Duplicates for Data Streams Over Sliding Windows [J] . Hong Shen, Yu Zhang 计算机科学技术学报（英文版） . 2008,第006期

机译：改进的滑动窗口上数据流重复项的近似检测
4. Streaming Quotient Filter: A Near Optimal Approximate Duplicate Detection Approach for Data Streams [C] . Sourav Dutta, Ankur Narang, Suman K. Bera International conference on very large data bases . 2013

机译：流商滤波器：数据流的近似最佳近似重复检测方法
5. Novel Class Detection and Cross-Lingual Duplicate Detection Over Online Data Stream [D] . Mustafa, Ahmad Mohammad. 2018

机译：在线数据流上的新型类检测和跨语言重复检测
6. Developing a Neural–Kalman Filtering Approach for Estimating Traffic Stream Density Using Probe Vehicle Data [O] . Mohammad A. Aljamal, Hossam M. Abdelghaffar, Hesham A. Rakha 2019

机译：开发神经-卡尔曼滤波方法以使用探测车辆数据估算交通流密度
7. Improved approximate detection of duplicates for data streams over sliding windows [O] . Hong Shen, Yu Zhang 张育 2015

机译：改进了滑动窗口上数据流重复的近似检测

Streaming Quotient Filter: A Near Optimal Approximate Duplicate Detection Approach for Data Streams

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅