首页> 外文会议>ACM SIGMOD international conference on management of data >Similarity Search and Locality Sensitive Hashing using Ternary Content Addressable Memories
【24h】

Similarity Search and Locality Sensitive Hashing using Ternary Content Addressable Memories

机译:使用三元内容可寻址存储的相似性搜索和局部敏感散列

获取原文

摘要

Similarity search methods are widely used as kernels in various data mining and machine learning applications including those in computational biology, web search/clustering. Nearest neighbor search (NNS) algorithms are often used to retrieve similar entries, given a query. While there exist efficient techniques for exact query lookup using hashing, similarity search using exact nearest neighbors suffers from a "curse of dimensionality", i.e. for high dimensional spaces, best known solutions offer little improvement over brute force search and thus are unsuitable for large scale streaming applications. Fast solutions to the approximate NNS problem include Locality Sensitive Hashing (LSH) based techniques, which need storage polynomial in n with exponent greater than f, and query time sublinear, but still polynomial in n, where n is the size of the database. In this work we present a new technique of solving the approximate NNS problem in Euclidean space using a Ternary Content Addressable Memory (TCAM), which needs near linear space and has O(l) query time. In fact, this method also works around the best known lower bounds in the cell probe model for the query time using a data structure near linear in the size of the data base. TCAMs are high performance associative memories widely used in networking applications such as address lookups and access control lists. A TCAM can query for a bit vector within a database of ternary vectors, where every bit posi- tion represents 0, f or *. The * is a wild card representing either a 0 or a 1. We leverage TCAMs to design a variant of LSH, called Ternary Locality Sensitive Hashing (TLSH) wherein we hash database entries represented by vectors in the Euclidean space into {0, 1, *}. By using the added functionality of a TLSH scheme with respect to the * character, we solve an instance of the approximate nearest neighbor problem with 1 TCAM access and storage nearly linear in the size of the database. We validate our claims with extensive simulations using both real world (Wikipedia) as well as synthetic (but illustrative) datasets. We observe that using a TCAM of width 288 bits, it is possible to solve the approximate NNS problem on a database of size 1 million points with high accuracy. Finally, we design an experiment with TCAMs within an enterprise ethernet switch (Cisco Catalyst 4500) to validate that TLSH can be used to perform f.5 million queries per second per lGb/s port. We believe that this work can open new avenues in very high speed data mining.
机译:相似性搜索方法广泛用作各种数据挖掘和机器学习应用中的内核,包括计算生物学,Web搜索/群集中的应用程序。定为查询,最近的邻居搜索(NNS)算法通常用于检索类似的条目。虽然使用散列的精确查询查找有效技术,但使用精确最近邻居的相似性搜索遭受了“维度的诅咒”,即对于高维空间,最佳已知的解决方案对蛮力搜索的改进很少,因此不适合大规模流媒体应用程序。近似NNS问题的快速解决方案包括基于位置敏感的散列(LSH)技术,其在N中需要存储多项式的技术,并且在N中查询时间载位,但仍然是多项式,其中N是数据库的大小。在这项工作中,我们使用三元内容可寻址存储器(TCAM)来求解欧几里德空间中的近似NNS问题的新技术,该内容可寻址存储器(TCAM)需要近线性空间并具有O(L)查询时间。实际上,这种方法还在Cell探测模型中的最佳已知的下限,用于查询时间,使用数据结构在数据库的大小上接近线性。 TCAMS是高性能关联存储器,广泛用于网络应用,如地址查找和访问控制列表。 TCAM可以在三元向量数据库中查询位矢量,其中每个位位置代表0,f或*。 *是一种代表0或a的通配符。我们利用TCAM设计一个LSH的变体,称为三元局部敏感散列(TLSH),其中我们哈希亚空间中的向量表示的哈希数据库条目进入{0,1, *}。通过使用TLSH方案的附加功能相对于*字符,我们解决了近似最近邻问题的一个实例,其中1个TCAM访问和存储在数据库大小中几乎线性。我们使用现实世界(维基百科)以及合成(但说明性)数据集进行广泛的模拟验证我们的索赔。我们观察到,使用宽度288位的TCAM,可以高精度地解决大小1百万点的数据库上的近似NNS问题。最后,我们设计了企业以太网交换机(Cisco Catalyst 4500)内的TCAM的实验,以验证TCHSH可用于每个LGB / S端口执行每秒的50万查询。我们相信这项工作可以在非常高速的数据挖掘中开辟新的途径。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号