Set similarity join on massive probabilistic data using MapReduce

Youzhong Ma; Xiaofeng Meng

首页> 外文期刊>Distributed and Parallel Databases >Set similarity join on massive probabilistic data using MapReduce

【24h】

Set similarity join on massive probabilistic data using MapReduce

机译：使用MapReduce在海量概率数据上设置相似性联接

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

In this paper, we focus on set similarity join on massive probabilistic data using MapReduce, there is no effective approach that can process this problem efficiently. MapReduce is a popular paradigm that can process large volume data more efficiently, in this paper, we proposed two approaches using MapReduce to deal with this task: Hadoop Join by Map Side Pruning and Hadoop Join by Reduce Side Pruning. Hadoop Join by Map Side Pruning uses the sum of the existence probability to filter out the probabilistic sets directly at the Map task side which have no any chance to be similar with any other probabilistic set. Hadoop Join by Reduce Side Pruning uses probability sum based pruning principle and probability upper bound based pruning principle to reduce the candidate pairs at Reduce task side, it can save the comparison cost. Based on the above approaches, we proposed a hybrid solution that employs both Map-side and Reduce-side pruning methods. Finally we implemented the above approaches on Hadoop-0.20.2 and performed comprehensive experiments to their performance, we also test the speedup ratio compared with the naive method: Block Nested Loop Join. The experiment results show that our approaches have much better performance than that of Block Nested Loop Join and also have good scalability. To the best of our knowledge, this is the first work to try to deal with set similarity join on massive probabilistic data problem using MapReduce paradigm, and the approaches proposed in this paper provide a new way to process the massive probabilistic data.

机译：在本文中，我们专注于使用MapReduce对海量概率数据进行集合相似性联接，没有有效的方法可以有效地解决此问题。 MapReduce是一种流行的范例，可以更有效地处理大量数据，在本文中，我们提出了两种使用MapReduce来处理此任务的方法：“通过Map Side Pruning进行Hadoop联接”和“ Reduce Side Pruning进行Hadoop联接”。 Hadoop通过Map Side Pruning进行联接使用存在概率之和直接在Map Task端过滤掉概率集，这些概率集没有任何机会与其他概率集相似。通过Reduce Side Pruning进行的Hadoop Join使用基于概率和的修剪原理和基于概率上限的修剪原理来减少Reduce任务侧的候选对，从而可以节省比较成本。基于以上方法，我们提出了一种混合解决方案，它同时使用了Map端和Reduce端修剪方法。最后，我们在Hadoop-0.20.2上实现了上述方法，并对它们的性能进行了全面的实验，我们还与朴素的方法（块嵌套循环联接）进行了测试。实验结果表明，我们的方法具有比“块嵌套循环连接”更好的性能，并且具有良好的可伸缩性。据我们所知，这是尝试使用MapReduce范式处理海量概率数据问题的集合相似性联接的第一项工作，本文提出的方法提供了一种处理海量概率数据的新方法。

著录项

来源
《Distributed and Parallel Databases》 |2014年第3期|447-464|共18页
作者
Youzhong Ma; Xiaofeng Meng;
展开▼
作者单位

School of Information, Renmin University of China, Beijing, China;

School of Information, Renmin University of China, Beijing, China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Set similarity join; MapReduce; Probabilistic data;

机译：设置相似性联接;MapReduce;概率数据;

相似文献

外文文献
中文文献
专利

1. Parallel similarity joins on massive high-dimensional data using MapReduce [J] . Ma Youzhong, Meng Xiaofeng, Wang Shaoya Concurrency and computation: practice and experience . 2016,第1期

机译：使用MapReduce将并行相似性连接到海量高维数据上
2. MapReduce Based Personalized Locality Sensitive Hashing for Similarity Joins on Large Scale Data [J] . JingjingWang, ChenLin Computational intelligence and neuroscience . 2015,第1期

机译：基于MapReduce的个性化本地敏感哈希，用于大规模数据上的相似联接
3. MapReduce Based Personalized Locality Sensitive Hashing for Similarity Joins on Large Scale Data [J] . Jingjing Wang, Chen Lin, J. Alfredo Hernandez Computational intelligence and neuroscience . 2015,第Pta1期

机译：基于MapReduce的个性化局部敏感散列，用于大规模数据上的相似性连接
4. Efficient Similarity Joins on Massive High-Dimensional Datasets Using MapReduce [C] . Luo Wuman, Tan Haoyu, Mao Huajian, 2012 IEEE 13th International Conference on Mobile Data Management. . 2012

机译：使用MapReduce在大量高维数据集上进行有效的相似性联接
5. ACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce [D] . Lakshminarayanan, Mahalakshmi. 2013

机译：ACE：使用MapReduce的敏捷，偶然和有效相似性联接
6. MapReduce Based Personalized Locality Sensitive Hashing for Similarity Joins on Large Scale Data [O] . Jingjing Wang, Chen Lin 2015

机译：基于MapReduce的个性化本地敏感哈希用于大规模数据上的相似联接
7. Set Similarity Join on Probabilistic Data [O] . Xiang Lian, Lei Chen 2010

机译：设置相似性加入概率数据
8. A probabilistic approach to mining massive Earth science data sets [R] . Braverman, Amy J., Fetzer, Eric 2005

机译：挖掘大规模地球科学数据集的概率方法

Set similarity join on massive probabilistic data using MapReduce

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅