首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters
【24h】

MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

机译:MAP-JOIN-REDUCE:进行大型集群的可扩展且高效的数据分析

获取原文
获取原文并翻译 | 示例

摘要

Data analysis is an important functionality in cloud computing which allows a huge amount of data to be processed over very large clusters. MapReduce is recognized as a popular way to handle data in the cloud environment due to its excellent scalability and good fault tolerance. However, compared to parallel databases, the performance of MapReduce is slower when it is adopted to perform complex data analysis tasks that require the joining of multiple data sets in order to compute certain aggregates. A common concern is whether MapReduce can be improved to produce a system with both scalability and efficiency. In this paper, we introduce Map-Join-Reduce, a system that extends and improves MapReduce runtime framework to efficiently process complex data analysis tasks on large clusters. We first propose a filtering-join-aggregation programming model, a natural extension of MapReduce's filtering-aggregation programming model. Then, we present a new data processing strategy which performs filtering-join-aggregation tasks in two successive MapReduce jobs. The first job applies filtering logic to all the data sets in parallel, joins the qualified tuples, and pushes the join results to the reducers for partial aggregation. The second job combines all partial aggregation results and produces the final answer. The advantage of our approach is that we join multiple data sets in one go and thus avoid frequent checkpointing and shuffling of intermediate results, a major performance bottleneck in most of the current MapReduce-based systems. We benchmark our system against Hive, a state-of-the-art MapReduce-based data warehouse on a 100-node cluster on Amazon EC2 using TPC-H benchmark. The results show that our approach significantly boosts the performance of complex analysis queries.
机译:数据分析是云计算中的一项重要功能,它允许在非常大的群集上处理大量数据。 MapReduce具有出色的可扩展性和良好的容错性,因此被认为是在云环境中处理数据的一种流行方法。但是,与并行数据库相比,当MapReduce用于执行复杂的数据分析任务时,其性能较慢,而复杂的数据分析任务需要连接多个数据集才能计算某些聚合。人们普遍关心的是,是否可以改进MapReduce以生产具有可伸缩性和效率的系统。在本文中,我们介绍了Map-Join-Reduce,这是一个扩展和改进MapReduce运行时框架的系统,可以有效地处理大型集群上的复杂数据分析任务。我们首先提出一个过滤联接聚合编程模型,这是MapReduce的过滤聚合编程模型的自然扩展。然后,我们提出了一种新的数据处理策略,该策略在两个连续的MapReduce作业中执行过滤-合并-聚合任务。第一项作业并行将过滤逻辑应用于所有数据集,联接合格的元组,然后将联接结果推送到约简器以进行部分聚合。第二项工作将所有部分汇总结果合并,并产生最终答案。我们方法的优点是可以一次性连接多个数据集,从而避免了频繁的检查点和中间结果的混洗,这是当前大多数基于MapReduce的系统的主要性能瓶颈。我们使用TPC-H基准针对Amazon EC2上100节点集群上基于MapReduce的最新数据仓库Hive进行了基准测试。结果表明,我们的方法大大提高了复杂分析查询的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号