首页> 外文期刊>Distributed and Parallel Databases >Set similarity join on massive probabilistic data using MapReduce
【24h】

Set similarity join on massive probabilistic data using MapReduce

机译:使用MapReduce在海量概率数据上设置相似性联接

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we focus on set similarity join on massive probabilistic data using MapReduce, there is no effective approach that can process this problem efficiently. MapReduce is a popular paradigm that can process large volume data more efficiently, in this paper, we proposed two approaches using MapReduce to deal with this task: Hadoop Join by Map Side Pruning and Hadoop Join by Reduce Side Pruning. Hadoop Join by Map Side Pruning uses the sum of the existence probability to filter out the probabilistic sets directly at the Map task side which have no any chance to be similar with any other probabilistic set. Hadoop Join by Reduce Side Pruning uses probability sum based pruning principle and probability upper bound based pruning principle to reduce the candidate pairs at Reduce task side, it can save the comparison cost. Based on the above approaches, we proposed a hybrid solution that employs both Map-side and Reduce-side pruning methods. Finally we implemented the above approaches on Hadoop-0.20.2 and performed comprehensive experiments to their performance, we also test the speedup ratio compared with the naive method: Block Nested Loop Join. The experiment results show that our approaches have much better performance than that of Block Nested Loop Join and also have good scalability. To the best of our knowledge, this is the first work to try to deal with set similarity join on massive probabilistic data problem using MapReduce paradigm, and the approaches proposed in this paper provide a new way to process the massive probabilistic data.
机译:在本文中,我们专注于使用MapReduce对海量概率数据进行集合相似性联接,没有有效的方法可以有效地解决此问题。 MapReduce是一种流行的范例,可以更有效地处理大量数据,在本文中,我们提出了两种使用MapReduce来处理此任务的方法:“通过Map Side Pruning进行Hadoop联接”和“ Reduce Side Pruning进行Hadoop联接”。 Hadoop通过Map Side Pruning进行联接使用存在概率之和直接在Map Task端过滤掉概率集,这些概率集没有任何机会与其他概率集相似。通过Reduce Side Pruning进行的Hadoop Join使用基于概率和的修剪原理和基于概率上限的修剪原理来减少Reduce任务侧的候选对,从而可以节省比较成本。基于以上方法,我们提出了一种混合解决方案,它同时使用了Map端和Reduce端修剪方法。最后,我们在Hadoop-0.20.2上实现了上述方法,并对它们的性能进行了全面的实验,我们还与朴素的方法(块嵌套循环联接)进行了测试。实验结果表明,我们的方法具有比“块嵌套循环连接”更好的性能,并且具有良好的可伸缩性。据我们所知,这是尝试使用MapReduce范式处理海量概率数据问题的集合相似性联接的第一项工作,本文提出的方法提供了一种处理海量概率数据的新方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号