首页> 外文会议>ACM SIGMOD international conference on management of data >Efficient Parallel Set-Similarity Joins Using MapReduce
【24h】

Efficient Parallel Set-Similarity Joins Using MapReduce

机译:使用MapReduce有效的并行集合相似性连接

获取原文

摘要

In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.
机译:在本文中,我们研究如何使用流行的MapReduce框架进行有效地执行集相似之处。我们提出了一种用于端到端集合相似性联合的三阶段方法。我们以输入一组记录和输出一组连接的记录,基于设置相似度条件。我们有效地跨节点分区数据,以便平衡工作负载并最大限度地减少对复制的需求。我们研究了自行连接和R-S连接案例,并展示了如何仔细控制每个节点上保存在主内存中的数据量。我们还提出了解决方案的情况,即使我们使用最细粒度的分区,数据仍然不适合节点的主存储器。我们报告了对实际数据集的广泛实验的结果,综合增加了大小,以评估使用Hadoop的所提出的算法的加速和扩展性质。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号