首页> 外文会议>International conference on management of data >Processing Theta-Joins using MapReduce
【24h】

Processing Theta-Joins using MapReduce

机译:使用MapReduce处理Theta-Joins

获取原文
获取外文期刊封面目录资料

摘要

Joins are essential for many data analysis tasks, but are not supported directly by the MapReduce paradigm. While there has been progress on equi-joins, implementation of join algorithms in MapReduce in general is not sufficiently understood. We study the problem of how to map arbitrary join conditions to Map and Reduce functions, i.e., a parallel infrastructure that controls data flow based on key-equality only. Our proposed join model simplifies creation of and reasoning about joins in MapReduce. Using this model, we derive a surprisingly simple randomized algorithm, called 1-Bucket-Theta, for implementing arbitrary joins (theta-joins) in a single MapReduce job. This algorithm only requires minimal statistics (input cardinality) and we provide evidence that for a variety of join problems, it is either close to optimal or the best possible option. For some of the problems where 1-Bucket-Theta is not the best choice, we show how to achieve better performance by exploiting additional input statistics. All algorithms can be made 'memory-aware', and they do not require any modifications to the MapReduce environment. Experiments show the effectiveness of our approach.
机译:对于许多数据分析任务来说,联接是必不可少的,但MapReduce范例不直接支持联接。尽管在等联接方面已经取得了进展,但总体上对MapReduce中的联接算法的实现还没有足够的了解。我们研究如何将任意联接条件映射到Map和Reduce函数的问题,即仅基于键相等性控制数据流的并行基础结构。我们提出的联接模型简化了MapReduce中联接的创建和推理。使用此模型,我们得出了一个令人惊讶的简单随机算法,称为1-Bucket-Theta,用于在单个MapReduce作业中实现任意联接(theta-joins)。该算法仅需要最少的统计信息(输入基数),并且我们提供的证据表明,对于各种连接问题,它要么接近最佳,要么是最佳选择。对于1-Bucket-Theta不是最佳选择的一些问题,我们展示了如何通过利用附加的输入统计信息来实现更好的性能。所有算法都可以设为“内存感知”的,并且不需要对MapReduce环境进行任何修改。实验证明了我们方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号