...
首页> 外文期刊>Future generation computer systems >CloudFlow: A data-aware programming model for cloud workflow applications on modern HPC systems
【24h】

CloudFlow: A data-aware programming model for cloud workflow applications on modern HPC systems

机译:CloudFlow:用于现代HPC系统上的云工作流应用程序的数据感知编程模型

获取原文
获取原文并翻译 | 示例
           

摘要

Traditional High-Performance Computing (HPC) based big-data applications are usually constrained by having to move large amount of data to compute facilities for real-time processing purpose. Modern HPC systems, represented by High-Throughput Computing (HTC) and Many-Task Computing (MTC) platforms, on the other hand, intend to achieve the long-held dream of moving compute to data instead. This kind of data-aware scheduling, typically represented by Hadoop MapReduce, has been successfully implemented in its Map Phase, whereby each Map Task is sent out to the compute node where the corresponding input data chunk is located. However, Hadoop MapReduce limits itself to a one-map-to-one-reduce framework, leading to difficulties for handling complex logics, such as pipelines or workflows. Meanwhile, it lacks built-in support and optimization when the input datasets are shared among multiple applications and/or jobs. The performance can be improved significantly when the knowledge of the shared and frequently accessed data is taken into scheduling decisions. To enhance the capability of managing workflow in modern HPC system, this paper presents CloudFlow, a Hadoop MapReduce based programming model for cloud workflow applications. CloudFlow is built on top of MapReduce, which is proposed not only being data aware, but also shared-data aware. It identifies the most frequently shared data, from both task-level and job-level, replicates them to each compute node for data locality purposes. It also supports user-defined multiple Map-and Reduce functions, allowing users to orchestrate the required data-flow logic. Mathematically, we prove the correctness of the whole scheduling framework by performing theoretical analysis. Further more, experimental evaluation also shows that the execution runtime speedup exceeds 4X compared to traditional MapReduce implementation with a manageable time overhead.
机译:基于传统高性能计算(HPC)的大数据应用程序通常受到限制,必须将大量数据移至用于实时处理目的的计算设施。另一方面,以高吞吐量计算(HTC)和多任务计算(MTC)平台为代表的现代HPC系统旨在实现将计算转移到数据的长期梦想。通常以Hadoop MapReduce表示的这种数据感知调度已在其Map阶段成功实现,由此每个Map Task被发送到相应输入数据块所在的计算节点。但是,Hadoop MapReduce将自身限制为一对一减少的框架,这给处理诸如管道或工作流之类的复杂逻辑带来了困难。同时,当输入数据集在多个应用程序和/或作业之间共享时,它缺乏内置的支持和优化。当将共享和频繁访问的数据的知识纳入调度决策时,可以显着提高性能。为了增强现代HPC系统中工作流的管理能力,本文提出了CloudFlow,这是一种基于Hadoop MapReduce的云工作流应用程序编程模型。 CloudFlow建立在MapReduce的基础上,MapReduce不仅被建议具有数据感知能力,而且还具有共享数据感知能力。它从任务级别和作业级别识别最频繁共享的数据,并将它们复制到每个计算节点以实现数据局部性。它还支持用户定义的多个Map-and Reduce功能,从而允许用户编排所需的数据流逻辑。在数学上,我们通过理论分析证明了整个调度框架的正确性。此外,实验评估还显示,与传统MapReduce实施相比,执行时运行速度提高了4倍以上,并且开销可控。

著录项

  • 来源
    《Future generation computer systems》 |2015年第10期|98-110|共13页
  • 作者单位

    Kavli Institute for Astrophysics and Space Research, Massachusetts Institute of Technology, Cambridge, MA 02139, USA;

    KINDI Center for Computing Research, Qatar University, Doha, Qatar;

    KINDI Center for Computing Research, Qatar University, Doha, Qatar;

    Department of Electrical and Computer Engineering, North Dakota State University, Fargo, ND 58108-6050, USA;

    Department of Computer Science, State University of New York, New Paltz, NY 12561, USA;

    School of Information Technologies, The University of Sydney, Sydney, NSW 2006, Australia;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Concurrency; Data aware; MapReduce; HPC; Programming model;

    机译:并发数据感知;MapReduce;HPC;程式设计模型;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号