MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

Jiang David; Tung Anthony K. H.; Chen Gang

首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

【24h】

MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

机译：MAP-JOIN-REDUCE：进行大型集群的可扩展且高效的数据分析

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Data analysis is an important functionality in cloud computing which allows a huge amount of data to be processed over very large clusters. MapReduce is recognized as a popular way to handle data in the cloud environment due to its excellent scalability and good fault tolerance. However, compared to parallel databases, the performance of MapReduce is slower when it is adopted to perform complex data analysis tasks that require the joining of multiple data sets in order to compute certain aggregates. A common concern is whether MapReduce can be improved to produce a system with both scalability and efficiency. In this paper, we introduce Map-Join-Reduce, a system that extends and improves MapReduce runtime framework to efficiently process complex data analysis tasks on large clusters. We first propose a filtering-join-aggregation programming model, a natural extension of MapReduce's filtering-aggregation programming model. Then, we present a new data processing strategy which performs filtering-join-aggregation tasks in two successive MapReduce jobs. The first job applies filtering logic to all the data sets in parallel, joins the qualified tuples, and pushes the join results to the reducers for partial aggregation. The second job combines all partial aggregation results and produces the final answer. The advantage of our approach is that we join multiple data sets in one go and thus avoid frequent checkpointing and shuffling of intermediate results, a major performance bottleneck in most of the current MapReduce-based systems. We benchmark our system against Hive, a state-of-the-art MapReduce-based data warehouse on a 100-node cluster on Amazon EC2 using TPC-H benchmark. The results show that our approach significantly boosts the performance of complex analysis queries.

机译：数据分析是云计算中的一项重要功能，它允许在非常大的群集上处理大量数据。 MapReduce具有出色的可扩展性和良好的容错性，因此被认为是在云环境中处理数据的一种流行方法。但是，与并行数据库相比，当MapReduce用于执行复杂的数据分析任务时，其性能较慢，而复杂的数据分析任务需要连接多个数据集才能计算某些聚合。人们普遍关心的是，是否可以改进MapReduce以生产具有可伸缩性和效率的系统。在本文中，我们介绍了Map-Join-Reduce，这是一个扩展和改进MapReduce运行时框架的系统，可以有效地处理大型集群上的复杂数据分析任务。我们首先提出一个过滤联接聚合编程模型，这是MapReduce的过滤聚合编程模型的自然扩展。然后，我们提出了一种新的数据处理策略，该策略在两个连续的MapReduce作业中执行过滤-合并-聚合任务。第一项作业并行将过滤逻辑应用于所有数据集，联接合格的元组，然后将联接结果推送到约简器以进行部分聚合。第二项工作将所有部分汇总结果合并，并产生最终答案。我们方法的优点是可以一次性连接多个数据集，从而避免了频繁的检查点和中间结果的混洗，这是当前大多数基于MapReduce的系统的主要性能瓶颈。我们使用TPC-H基准针对Amazon EC2上100节点集群上基于MapReduce的最新数据仓库Hive进行了基准测试。结果表明，我们的方法大大提高了复杂分析查询的性能。

著录项

来源
《Knowledge and Data Engineering, IEEE Transactions on》 |2011年第9期|p.1299-1311|共13页
作者
Jiang David; Tung Anthony K. H.; Chen Gang;
展开▼
作者单位

National University of Singapore, Singapore;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Cloud computing; parallel systems; query processing.;

机译：云计算;并行系统;查询处理。;

相似文献

外文文献
中文文献
专利

1. SCALE: a scalable framework for efficiently clustering transactional data [J] . Hua Yan, Keke Chen, Ling Liu, Data mining and knowledge discovery . 2010,第1期

机译：SCALE：一个可扩展的框架，用于有效地集群交易数据
2. An Efficient Visual Analysis Method for Cluster Tendency Evaluation, Data Partitioning and Internal Cluster Validation [J] . Prabhu, Puniethaa, Duraiswamy, Computing and informatics . 2014,第5期

机译：集群趋势评估，数据划分和内部集群验证的高效可视化分析方法
3. AN EFFICIENT VISUAL ANALYSIS METHOD FOR CLUSTER TENDENCY EVALUATION, DATA PARTITIONING AND INTERNAL CLUSTER VALIDATION [J] . Puniethaa Prabhu, Karuppusamy Duraiswamy Computing and informatics . 2013,第5期

机译：集群趋势评估，数据划分和内部集群验证的高效可视化分析方法
4. DDoS attack detection approach using an efficient cluster analysis in large data scale [C] . Wesam Bhaya, Mehdi EbadyManaa 2017 Annual Conference on New Trends in Information amp; Communications Technology Applications . 2017

机译：使用大数据规模的有效集群分析的DDoS攻击检测方法
5. Efficient Sequence Clustering and Embedding Algorithms for Large-scale Metagenomics Data [D] . Zheng, Wei. 2019

机译：大规模偏心组织数据的高效序列聚类和嵌入算法
6. Thumbnail Tensor—A Method for Multidimensional Data Streams Clustering with an Efficient Tensor Subspace Model in the Scale-Space [O] . Bogusław Cyganek 2019

机译：缩略图张量-一种在尺度空间中使用有效张量子空间模型进行多维数据流聚类的方法
7. SCALE: A Scalable Framework for Efficiently Clustering Transactional Data [O] . Yan, Hua, Chen, Keke, Liu, Ling, 2014

机译：SCALE：可扩展的框架，可有效地将交易数据聚类

MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅