首页> 外文期刊>Concurrency and computation: practice and experience >Parameterizable benchmarking framework for designing a MapReduce performance model†
【24h】

Parameterizable benchmarking framework for designing a MapReduce performance model†

机译:用于设计MapReduce性能模型的可参数化基准测试框架†

获取原文
获取原文并翻译 | 示例

摘要

In MapReduce environments, many applications have to achieve different performance goals for producing time relevant results. One of typical user questions is how to estimate the completion time of a MapReduce program as a function of varying input dataset sizes and given cluster resources. In this work, we offer a novel performance evaluation framework for answering this question. We analyze the MapReduce processing pipeline and utilize the fact that the execution of map (reduce) tasks consists of specific, well-defined data processing phases. Only map and reduce functions are custom, and their executions are user-defined for different MapReduce jobs. The executions of the remaining phases are generic (i.e., defined by the MapReduce framework code) and depend on the amount of data processed by the phase and the performance of the underlying Hadoop cluster. First, we design a set of parameterizable microbenchmarks to profile the execution of generic phases and to derive a platform performance model of a given Hadoop cluster. Then, using the job past executions, we summarize job's properties and performance of its custom map/reduce functions in a compact job profile. Finally, by combining the knowledge of the job profile and the derived platform performance model, we introduce a MapReduce performance model that estimates the program completion time for processing a new dataset. The proposed benchmarking approach derives an accurate performance model of Hadoop's generic execution phases (once), and then, this model is reused for predicting the performance of different applications. The evaluation study justifies our approach and the proposed framework: We use a diverse suite of 12 MapReduce applications to validate the proposed model. The predicted completion times for most experiments are within 10% of the measured ones (with a worst case resulting in 17% of error) on our 66-node Hadoop cluster. Copyright © 2014 John Wiley & Sons, Ltd
机译:在MapReduce环境中,许多应用程序必须达到不同的性能目标才能产生与时间相关的结果。用户的典型问题之一是如何根据变化的输入数据集大小和给定的群集资源来估计MapReduce程序的完成时间。在这项工作中,我们提供了一个新颖的绩效评估框架来回答这个问题。我们分析了MapReduce处理管道,并利用了map(reduce)任务的执行包含特定的,定义明确的数据处理阶段这一事实。只有map和reduce函数是自定义的,并且它们的执行是用户为不同的MapReduce作业定义的。其余阶段的执行是通用的(即由MapReduce框架代码定义),并且取决于该阶段处理的数据量和基础Hadoop集群的性能。首先,我们设计一组可参数化的微基准测试,以描述通用阶段的执行情况,并得出给定Hadoop集群的平台性能模型。然后,使用过去执行的作业,在紧凑的作业配置文件中总结作业的属性及其自定义映射/归约功能的性能。最后,通过结合工作资料和派生的平台性能模型的知识,我们引入了 MapReduce性能模型,该模型可以估算处理新数据集的程序完成时间。提出的基准测试方法可得出Hadoop通用执行阶段(一次)的准确性能模型,然后对该模型进行重用以预测不同应用程序的性能。评估研究证明了我们的方法和建议的框架的合理性:我们使用12套MapReduce应用程序的不同套件来验证建议的模型。在我们的66节点Hadoop集群上,大多数实验的预计完成时间在实测值的10%以内(最坏的情况是导致17%的错误)。版权所有©2014 John Wiley&Sons,Ltd

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号