SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

Rong Gu; Xiaoliang Yang; Jinshuang Yan; Yuanhao Sun; Bing Wang; Chunfeng Yuan; Yihua Huang

首页> 外文期刊>Journal of Parallel and Distributed Computing >SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

【24h】

SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

机译：SHadoop：通过优化Hadoop集群中的作业执行机制来提高MapReduce性能

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

As a widely-used parallel computing framework for big data processing today, the Hadoop MapReduce framework puts more emphasis on high-throughput of data than on low-latency of job execution. However, today more and more big data applications developed with MapReduce require quick response time. As a result, improving the performance of MapReduce jobs, especially for short jobs, is of great significance in practice and has attracted more and more attentions from both academia and industry. A lot of efforts have been made to improve the performance of Hadoop from job scheduling or job parameter optimization level. In this paper, we explore an approach to improve the performance of the Hadoop MapReduce framework by optimizing the job and task execution mechanism. First of all, by analyzing the job and task execution mechanism in MapReduce framework we reveal two critical limitations to job execution performance. Then we propose two major optimizations to the MapReduce job and task execution mechanisms: first, we optimize the setup and cleanup tasks of a MapReduce job to reduce the time cost during the initialization and termination stages of the job; second, instead of adopting the loose heartbeat-based communication mechanism to transmit all messages between the JobTracker and TaskTrackers, we introduce an instant messaging communication mechanism for accelerating performance-sensitive task scheduling and execution. Finally, we implement SHadoop, an optimized and fully compatible version of Hadoop that aims at shortening the execution time cost of MapReduce jobs, especially for short jobs. Experimental results show that compared to the standard Hadoop, SHadoop can achieve stable performance improvement by around 25% on average for comprehensive benchmarks without losing scalability and speedup. Our optimization work has passed a production-level test in Intel and has been integrated into the Intel Distributed Hadoop (IDH). To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks of a job. The advantage is that it can complement job scheduling optimizations to further improve the job execution performance.

机译：作为当今用于大数据处理的广泛使用的并行计算框架，Hadoop MapReduce框架更加注重数据的高吞吐量，而不是工作执行的低延迟。但是，如今，越来越多的使用MapReduce开发的大数据应用程序需要快速的响应时间。因此，改善MapReduce作业的性能，特别是对于短期作业，在实践中具有重要意义，并且已引起学术界和工业界越来越多的关注。为了从作业调度或作业参数优化级别提高Hadoop的性能，已经做了很多努力。在本文中，我们探索了一种通过优化作业和任务执行机制来提高Hadoop MapReduce框架性能的方法。首先，通过分析MapReduce框架中的作业和任务执行机制，我们揭示了作业执行性能的两个关键限制。然后，我们对MapReduce作业和任务执行机制提出了两个主要的优化方案：首先，我们优化MapReduce作业的设置和清理任务，以减少作业初始化和终止阶段的时间成本；其次，我们没有采用基于松散的基于心跳的通信机制来在JobTracker和TaskTrackers之间传输所有消息，而是引入了一种即时消息传递通信机制来加速对性能敏感的任务调度和执行。最后，我们实现SHadoop，这是Hadoop的一种优化且完全兼容的版本，旨在缩短MapReduce作业的执行时间成本，尤其是对于短作业。实验结果表明，与标准Hadoop相比，对于综合基准而言，SHadoop可以将性能稳定地平均提高25％左右，而不会损失可伸缩性和速度。我们的优化工作已通过英特尔的生产级测试，并已集成到英特尔分布式Hadoop（IDH）中。就我们所知，这项工作是探索优化作业的map / reduce任务内部执行机制的第一步。优点是，它可以补充作业计划优化，从而进一步提高作业执行性能。

著录项

来源
《Journal of Parallel and Distributed Computing》 |2014年第3期|2166-2179|共14页
作者
Rong Gu; Xiaoliang Yang; Jinshuang Yan; Yuanhao Sun; Bing Wang; Chunfeng Yuan; Yihua Huang;
展开▼
作者单位

National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China;

National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China;

National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China;

Intel Asia-Pacific Research and Development Ltd, 880 ZiXing Road, Zizhu Science Park, Shanghai, 200241, China;

Intel Asia-Pacific Research and Development Ltd, 880 ZiXing Road, Zizhu Science Park, Shanghai, 200241, China;

National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China;

National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Parallel computing; MapReduce; Performance optimization; Distributed processing; Cloud computing;

机译：并行计算MapReduce;性能优化;分布式处理;云计算;

相似文献

外文文献
中文文献
专利

1. Optimized Speculative Execution to Improve Performance of MapReduce Jobs on Virtualized Computing Environment [J] . Yang Lei, Dai Yu, Zhang Bin Mathematical Problems in Engineering . 2017,第PTa11期

机译：优化的推测执行以提高虚拟化计算环境中MapReduce作业的性能
2. Performance optimization of MapReduce-based Apriori algorithm on Hadoop cluster [J] . Singh Sudhakar, Garg Rakhi, Mishra P. K. Computers and Electrical Engineering . 2018,第期

机译：基于MapReduce的Apriori算法在Hadoop集群中的性能优化
3. Job-Aware File-Storage Optimization for Improved Hadoop I/O Performance [J] . Makoto NAKAGAMI, Jose A.B. FORTES, Saneyasu YAMAGUCHI IEICE transactions on information and systems . 2020,第10期

机译：Job-Invusine文件存储优化，用于改进Hadoop I / O性能
4. Performance Optimization for Short MapReduce Job Execution in Hadoop [C] . Yan Jinshuang, Yang Xiaoliang, Gu Rong, The Second International Conference on Cloud and Green Computing. . 2012

机译：Hadoop中短MapReduce作业执行的性能优化
5. Improving Hadoop performance by using metadata of related jobs in text datasets via enhancing MapReduce workflow. [D] . Alshammari, Hamoud. 2016

机译：通过增强MapReduce工作流程，在文本数据集中使用相关作业的元数据来提高Hadoop性能。
6. Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce [O] . Ablimit Aji, Fusheng Wang, Hoang Vo, -1

机译：Hadoop-GIS：基于MapReduce的高性能空间数据仓库系统
7. Optimized Speculative Execution to Improve Performance of MapReduce Jobs on Virtualized Computing Environment [O] . Lei Yang, Yu Dai, Bin Zhang 2017

机译：优化的投机执行，以提高MapReduce作业对虚拟化计算环境的性能

SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅