首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >A Generic Method for Accelerating LSH-Based Similarity Join Processing
【24h】

A Generic Method for Accelerating LSH-Based Similarity Join Processing

机译:一种基于LSH的相似联接处理的通用方法

获取原文
获取原文并翻译 | 示例
           

摘要

Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to speed up the process of joining two large datasets using LSH. The crux of our method lies in the way which we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyzes show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: the Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyzes and show significant improvements of our method over the state-of-the-art LSH method: to achieve over 0.95 recall, we only need to operate LSH lookups for at most 15 percent of the query points.
机译:局部敏感哈希(LSH)是解决高维空间中近似相似性搜索问题的有效方法。通过LSH,可以以与哈希联接相同的方式处理高维相似联接,从而使两个大型数据集的联接成本线性化。通过司法分析多种LSH算法的属性,我们提出了一种通用方法来加快使用LSH合并两个大型数据集的过程。我们方法的关键在于确定一组代表性点以减少LSH查找次数的方式。理论分析表明,与对每个查询点执行LSH查找相比,我们提出的方法可以大大减少查找操作的数量,并保持相同的结果精度。此外,我们通过证明相同的原理可以应用于LSH算法的三个不同指标来证明我们方法的通用性:三个指标:欧氏距离(QALSH),雅卡德相似性度量(MinHash)和汉明距离(序列哈希)。使用真实数据集进行的实验研究结果证实了我们的误差分析,并表明我们的方法相对于最新的LSH方法有显着改进:要实现0.95以上的召回率,我们最多需要对LSH查找进行15%的查找查询点。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号