首页> 外文期刊>Journal of Parallel and Distributed Computing >SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters
【24h】

SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

机译:SHadoop:通过优化Hadoop集群中的作业执行机制来提高MapReduce性能

获取原文
获取原文并翻译 | 示例

摘要

As a widely-used parallel computing framework for big data processing today, the Hadoop MapReduce framework puts more emphasis on high-throughput of data than on low-latency of job execution. However, today more and more big data applications developed with MapReduce require quick response time. As a result, improving the performance of MapReduce jobs, especially for short jobs, is of great significance in practice and has attracted more and more attentions from both academia and industry. A lot of efforts have been made to improve the performance of Hadoop from job scheduling or job parameter optimization level. In this paper, we explore an approach to improve the performance of the Hadoop MapReduce framework by optimizing the job and task execution mechanism. First of all, by analyzing the job and task execution mechanism in MapReduce framework we reveal two critical limitations to job execution performance. Then we propose two major optimizations to the MapReduce job and task execution mechanisms: first, we optimize the setup and cleanup tasks of a MapReduce job to reduce the time cost during the initialization and termination stages of the job; second, instead of adopting the loose heartbeat-based communication mechanism to transmit all messages between the JobTracker and TaskTrackers, we introduce an instant messaging communication mechanism for accelerating performance-sensitive task scheduling and execution. Finally, we implement SHadoop, an optimized and fully compatible version of Hadoop that aims at shortening the execution time cost of MapReduce jobs, especially for short jobs. Experimental results show that compared to the standard Hadoop, SHadoop can achieve stable performance improvement by around 25% on average for comprehensive benchmarks without losing scalability and speedup. Our optimization work has passed a production-level test in Intel and has been integrated into the Intel Distributed Hadoop (IDH). To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks of a job. The advantage is that it can complement job scheduling optimizations to further improve the job execution performance.
机译:作为当今用于大数据处理的广泛使用的并行计算框架,Hadoop MapReduce框架更加注重数据的高吞吐量,而不是工作执行的低延迟。但是,如今,越来越多的使用MapReduce开发的大数据应用程序需要快速的响应时间。因此,改善MapReduce作业的性能,特别是对于短期作业,在实践中具有重要意义,并且已引起学术界和工业界越来越多的关注。为了从作业调度或作业参数优化级别提高Hadoop的性能,已经做了很多努力。在本文中,我们探索了一种通过优化作业和任务执行机制来提高Hadoop MapReduce框架性能的方法。首先,通过分析MapReduce框架中的作业和任务执行机制,我们揭示了作业执行性能的两个关键限制。然后,我们对MapReduce作业和任务执行机制提出了两个主要的优化方案:首先,我们优化MapReduce作业的设置和清理任务,以减少作业初始化和终止阶段的时间成本;其次,我们没有采用基于松散的基于心跳的通信机制来在JobTracker和TaskTrackers之间传输所有消息,而是引入了一种即时消息传递通信机制来加速对性能敏感的任务调度和执行。最后,我们实现SHadoop,这是Hadoop的一种优化且完全兼容的版本,旨在缩短MapReduce作业的执行时间成本,尤其是对于短作业。实验结果表明,与标准Hadoop相比,对于综合基准而言,SHadoop可以将性能稳定地平均提高25%左右,而不会损失可伸缩性和速度。我们的优化工作已通过英特尔的生产级测试,并已集成到英特尔分布式Hadoop(IDH)中。就我们所知,这项工作是探索优化作业的map / reduce任务内部执行机制的第一步。优点是,它可以补充作业计划优化,从而进一步提高作业执行性能。

著录项

  • 来源
    《Journal of Parallel and Distributed Computing》 |2014年第3期|2166-2179|共14页
  • 作者单位

    National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China;

    National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China;

    National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China;

    Intel Asia-Pacific Research and Development Ltd, 880 ZiXing Road, Zizhu Science Park, Shanghai, 200241, China;

    Intel Asia-Pacific Research and Development Ltd, 880 ZiXing Road, Zizhu Science Park, Shanghai, 200241, China;

    National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China;

    National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Parallel computing; MapReduce; Performance optimization; Distributed processing; Cloud computing;

    机译:并行计算MapReduce;性能优化;分布式处理;云计算;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号