首页> 外文学位 >Performance Tuning of MapReduce Programs.
【24h】

Performance Tuning of MapReduce Programs.

机译:MapReduce程序的性能调优。

获取原文
获取原文并翻译 | 示例

摘要

This dissertation addresses performance tuning of MapReduce programs. The MapReduce framework simplifies processing of large datasets across a large number of machines as a user only needs to implement map and reduce functions to create a scalable distributed application. The framework takes care of all other operations such as creating tasks for each function, parallelizing the tasks, distributing data, and handling machine failures. MapReduce programs run in both Hadoop and YARN. Hadoop is a computing framework built on the design of the original MapReduce framework and YARN is a generalized container oriented large scale data processing framework that runs MapReduce applications.;In this dissertation, we first characterize MapReduce programs based on CPU and IO usage of a map task. Our findings show that, based on the similarity of performance of applications under different task parallelism setting, MapReduce applications can be grouped into three categories --- IO-intensive, Balanced, and CPU-intensive using cutoffs for the CPU usage. Applications belonging to each group exhibit similar map completion time characteristics.;Second, we develop a static tuning method for setting task parallelism for MapReduce programs. We evaluate thirteen MapReduce applications from all three application categories. We used two clusters with different architecture and obtained the same finding that IO-intensive applications have best normalized map task parallelism of below 1, Balanced have normalized map task parallelism of 1 or above, and CPU-intensive have normalized map task parallelism close to 1. Normalized map task parallelism is the ratio of the number of map tasks with respect to the number of CPU contexts present in a system. This static method of using task parallelism values based on the category of an application is more efficient than exhaustively profiling or using a default setting.;Third, we develop a feedback controller based dynamic tuning approach to adjust the task parallelism during runtime execution of MapReduce applications. For this, we measure map completion time versus metric values and identify three instantaneously measurable operating system metrics---user CPU, blocked processes, and context switch value, as indicators of applications performance during runtime execution. Using these metrics and a combined value called score, we develop PID controller for Hadoop and Waterlevel, PD, and PD+pruning controllers for YARN. Our findings show that dynamically changing task parallelism using feedback controllers achieves performance close to using the optimal task parallelism values, and achieves performance better than default and best practices, while having an added benefit of not requiring application profiling.;Fourth, we study the performance effects of data scaling and configuration parameters on MapReduce programs when running in a large cluster having 540 nodes. We find that IO intensive applications do not scale when data size increases. Configuration parameters that change task parallelism and overlap affect application performance. This study also uncovers issues that occur at large scale such as production of huge logs, need for changing allocation strategy for tasks that coordinate application execution, and need for using data types that can handle calculations involving large numbers without causing over ows.;Fifth, we develop a log compression technique that compresses the log messages online during the execution of a MapReduce application. It does so by encoding each log message by log identifier of log message templates derived from Hadoop/YARN's source code. Our findings show that this technique reduces the log size to one-third of the raw uncompressed log size with 3% overhead on application completion time.
机译:本文针对MapReduce程序的性能调优。 MapReduce框架简化了跨大量机器的大型数据集的处理,因为用户只需要实现地图并减少功能即可创建可扩展的分布式应用程序。该框架负责所有其他操作,例如为每个功能创建任务,并行化任务,分发数据以及处理机器故障。 MapReduce程序可在Hadoop和YARN中运行。 Hadoop是在原始MapReduce框架的设计基础上建立的计算框架,而YARN是运行MapReduce应用程序的通用的面向容器的大规模数据处理框架。在本文中,我们首先根据地图的CPU和IO使用情况来表征MapReduce程序。任务。我们的发现表明,基于不同任务并行性设置下应用程序性能的相似性,MapReduce应用程序可以分为三类-IO密集型,Balanced和CPU密集型(使用针对CPU使用情况的临界值)。属于每个组的应用程序表现出相似的地图完成时间特征。其次,我们开发了一种静态调整方法来为MapReduce程序设置任务并行性。我们从所有三个应用程序类别中评估了13个MapReduce应用程序。我们使用了两个具有不同体系结构的集群,并获得了以下相同的发现:IO密集型应用程序的标准化任务并行度最好低于1,平衡的标准化任务并行度为1或更高,而CPU密集型应用程序标准化任务的并行度接近1。规范化的映射任务并行度是映射任务的数量与系统中存在的CPU上下文的数量之比。这种基于应用程序类别使用任务并行度值的静态方法比详尽分析或使用默认设置要有效。;第三,我们开发了一种基于反馈控制器的动态调整方法,可在MapReduce应用程序运行时执行期间调整任务并行度。为此,我们测量映射完成时间与指标值的关系,并确定三个可即时测量的操作系统指标-用户CPU,阻塞的进程和上下文切换值,作为运行时执行期间应用程序性能的指标。使用这些指标和称为得分的组合值,我们为Hadoop和Waterlevel开发了PID控制器,PD和YARN的PD +修剪控制器。我们的发现表明,使用反馈控制器动态更改任务并行性可以获得的性能接近于使用最佳任务并行性值,并且比默认和最佳实践更好地实现了性能,同时还具有不需要对应用程序进行性能分析的额外好处。第四,我们研究了性能在具有540个节点的大型集群中运行时,数据缩放和配置参数对MapReduce程序的影响。我们发现,随着数据大小的增加,IO密集型应用程序无法扩展。更改任务并行性和重叠的配置参数会影响应用程序性能。这项研究还发现了大规模发生的问题,例如产生大量日志,需要更改用于协调应用程序执行的任务的分配策略以及需要使用可以处理涉及大量计算而不会造成流转的数据类型。我们开发了一种日志压缩技术,该技术可在执行MapReduce应用程序期间在线压缩日志消息。它通过使用从Hadoop / YARN的源代码派生的日志消息模板的日志标识符对每个日志消息进行编码来实现。我们的发现表明,该技术将日志大小减少到原始未压缩日志大小的三分之一,而应用程序完成时间的开销为3%。

著录项

  • 作者

    K.C., Kamal.;

  • 作者单位

    North Carolina State University.;

  • 授予单位 North Carolina State University.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2015
  • 页码 110 p.
  • 总页数 110
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号