首页> 外文期刊>Science of Computer Programming >Parallel set similarity join on big data based on Locality-Sensitive Hashing
【24h】

Parallel set similarity join on big data based on Locality-Sensitive Hashing

机译:基于局部敏感哈希的大数据并行集合相似性联接

获取原文
获取原文并翻译 | 示例

摘要

Due to the huge amount of involved data and time-consuming process of join operations, the exact-match joins are rarely used for big data. The most common alternative for exact-match joins are similarity joins which find similar pairs of records. Set similarity join (SSJ) is defined as join of very large tables based on similarity of a set of their attributes, which is called join attributes. To operate the similarity join of two large tables based on their join attributes, the similarity of the values of the join attributes is specified using an appropriate similarity function, and then, the values pairs which their similarity is higher than a certain threshold, are selected as candidates for join in case of similarity. In this paper, a parallel set similarity join method is introduced using the MapReduce programming model. The proposed method uses Locality Sensitive Hashing (LSH) techniques to decrease the number of required comparisons for calculating the similarity of the sets. The performance of the proposed method has been compared with the best previous similarity join methods on real and synthetic datasets in terms of time. The experimental results show that the proposed method works faster than the former methods.
机译:由于涉及的数据量巨大,并且联接操作耗时,因此完全匹配联接很少用于大数据。完全匹配联接的最常见替代方法是相似联接,用于查找相似的记录对。集合相似性连接(SSJ)被定义为基于大型表的一组属性的相似性的联接,这称为联接属性。为了基于两个大表的联接属性进行相似联接,使用适当的相似度函数指定联接属性的值的相似度,然后选择其相似度高于某个阈值的值对。在相似的情况下作为参加的候选人。在本文中,使用MapReduce编程模型介绍了并行集相似性联接方法。所提出的方法使用局部敏感哈希(LSH)技术来减少用于计算集合相似度所需的比较次数。在时间上,已经将所提出的方法的性能与真实和合成数据集上的最佳先前相似性连接方法进行了比较。实验结果表明,提出的方法比以前的方法工作更快。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号