Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing

机译：使用并行局部敏感散列媒体相似性搜索超过10亿推文

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kd-trees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects. In this paper, we describe a new variant of LSH, called Parallel LSH (PLSH) designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports high-throughput streaming of new data. Our approach employs several novel ideas, including: cache-conscious hash table layout, using a 2-level merge algorithm for hash table construction; an efficient algorithm for duplicate elimination during hash-table querying; an insert-optimized hash table structure and efficient data expiration algorithm for streaming data; and a performance model that accurately estimates performance of the algorithm and can be used to optimize parameter settings. We show that on a workload where we perform similarity search on a dataset of ＞ 1 Billion tweets, with hundreds of millions of new tweets per day, we can achieve query times of 1-2.5 ms. We show that this is an order of magnitude faster than existing indexing schemes, such as inverted indexes. To the best of our knowledge, this is the fastest implementation of LSH, with table construction times up to 3.7 × faster and query times that are 8.3 × faster than a basic implementation.

机译：查找最近的邻居已成为数据库的一个重要操作，其中应用于文本搜索，多媒体索引以及许多其他区域。一种适用于相似性搜索的流行算法，特别是对于高维数据（如KD树的空间索引不执行井）是用于查找类似对象的近似算法的临时敏感散列（LSH）。在本文中，我们描述了一个新的LSH变体，被设计为非常有效的并行LSH（PLSH），能够在多个节点和多个核上进行扩展，并且支持新数据的高吞吐量流。我们的方法采用了几种新颖的想法，包括：使用2级合并算法进行缓存有意识的散列表布局，用于散列表构造;哈希表查询期间重复消除的高效算法;用于流数据的插入优化的哈希表结构和高效数据到期算法;和准确估计算法性能的性能模型，可用于优化参数设置。我们展示了在我们在DataSet上执行相似性搜索的工作量，每天有数亿个新推文，我们可以实现1-2.5毫秒的查询时间。我们表明这是比现有索引方案快的数量级，例如反相索引。据我们所知，这是LSH的最快实现，表施工时间高达3.7倍，查询时间快8.3倍，而不是基本实现。

著录项

来源
《International conference on very large data bases》|2013年||共12页
会议地点
作者
Narayanan Sundaram; Aizana Turmukhametova; Nadathur Satish; Todd Mostak; Piotr Indyk; Samuel Madden; Pradeep Dubey;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.13;
关键词

相似文献

外文文献
中文文献
专利

1. Parallel set similarity join on big data based on Locality-Sensitive Hashing [J] . Mohammad Karim Sohrabi, Hosseion Azgomi Science of Computer Programming . 2017,第octa1期

机译：基于局部敏感哈希的大数据并行集合相似性联接
2. Fast distributed video deduplication via locality-sensitive hashing with similarity ranking [J] . Yeguang Li, Liang Hu, Ke Xia, EURASIP journal on image and video processing . 2019,第1期

机译：通过具有相似性排名的位置敏感散列快速分布式视频重复数据删除
3. A novel locality-sensitive hashing algorithm for similarity searches on large-scale hyperspectral data [J] . Zhou Yuan, Liu Chun, Li Nan, Remote sensing letters . 2016,第10a12期

机译：一种用于大规模高光谱数据相似性搜索的局部敏感哈希算法
4. Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing [C] . Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, International conference on very large data bases . 2013

机译：使用并行局部敏感哈希在10亿条推文上流式传输相似性搜索
5. Learning Effective Binary Representation with Deep Hashing Technique for Large-Scale Multimedia Similarity Search [D] . Wu, Gengshen. 2020

机译：学习具有深度散列技术的有效二进制表示，用于大规模多媒体相似性搜索
6. Fast Image Search with Locality-Sensitive Hashing and Homogeneous Kernels Map [O] . Jun-yi Li, Jian-hua Li 2015

机译：通过局部敏感的散列和均匀核图进行快速图像搜索
7. Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing [O] . Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, 2014

机译：使用并行局部敏感哈希在10亿条推文上流式传输相似性搜索

Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅