首页> 外文会议>International conference on very large data bases >Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing
【24h】

Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing

机译:使用并行局部敏感散列媒体相似性搜索超过10亿推文

获取原文

摘要

Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kd-trees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects. In this paper, we describe a new variant of LSH, called Parallel LSH (PLSH) designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports high-throughput streaming of new data. Our approach employs several novel ideas, including: cache-conscious hash table layout, using a 2-level merge algorithm for hash table construction; an efficient algorithm for duplicate elimination during hash-table querying; an insert-optimized hash table structure and efficient data expiration algorithm for streaming data; and a performance model that accurately estimates performance of the algorithm and can be used to optimize parameter settings. We show that on a workload where we perform similarity search on a dataset of > 1 Billion tweets, with hundreds of millions of new tweets per day, we can achieve query times of 1-2.5 ms. We show that this is an order of magnitude faster than existing indexing schemes, such as inverted indexes. To the best of our knowledge, this is the fastest implementation of LSH, with table construction times up to 3.7 × faster and query times that are 8.3 × faster than a basic implementation.
机译:查找最近的邻居已成为数据库的一个重要操作,其中应用于文本搜索,多媒体索引以及许多其他区域。一种适用于相似性搜索的流行算法,特别是对于高维数据(如KD树的空间索引不执行井)是用于查找类似对象的近似算法的临时敏感散列(LSH)。在本文中,我们描述了一个新的LSH变体,被设计为非常有效的并行LSH(PLSH),能够在多个节点和多个核上进行扩展,并且支持新数据的高吞吐量流。我们的方法采用了几种新颖的想法,包括:使用2级合并算法进行缓存有意识的散列表布局,用于散列表构造;哈希表查询期间重复消除的高效算法;用于流数据的插入优化的哈希表结构和高效数据到期算法;和准确估计算法性能的性能模型,可用于优化参数设置。我们展示了在我们在DataSet上执行相似性搜索的工作量,每天有数亿个新推文,我们可以实现1-2.5毫秒的查询时间。我们表明这是比现有索引方案快的数量级,例如反相索引。据我们所知,这是LSH的最快实现,表施工时间高达3.7倍,查询时间快8.3倍,而不是基本实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号