首页> 外文会议>International conference on parallel and distributed comuting >Slurm-V: Extending Slurm for Building Efficient HPC Cloud with SR-IOV and IVShmem
【24h】

Slurm-V: Extending Slurm for Building Efficient HPC Cloud with SR-IOV and IVShmem

机译:Slurm-V:扩展Slurm以使用SR-IOV和IVShmem构建高效的HPC云

获取原文

摘要

To alleviate the cost burden, efficiently sharing HPC cluster resources to end users through virtualization is becoming more and more attractive. In this context, some critical HPC resources among Virtual Machines, such as Single Root I/O Virtualization (SR-IOV) enabled Virtual Functions (VFs) and Inter-VM Shared memory (IVShmem) devices, need to be enabled and isolated to support efficiently running multiple concurrent MPI jobs on HPC clouds. However, original Slurm is not able to supervise VMs and associated critical resources, such as VFs and IVShmem. This paper proposes a novel framework, Slurm-V, which extends Slurm with virtualization-oriented capabilities such as job submission to dynamically created VMs with isolated SR-IOV and IVShmem resources. We propose several alternative designs for Slurm-V: Task-based design, SPANK plugin-based design, and SPANK plugin over OpenStack-based design, to manage and isolate IVShmem and SR-IOV resources for running MPI jobs. We evaluate these designs from aspects of startup performance, scalability, and application performance in different scenarios. The evaluation results show that VM startup time can be reduced by up to 2.64X through snapshot scheme in Slurm SPANK plugin. Our proposed Slurm-V framework shows good scalability and the ability of efficiently running concurrent MPI jobs on SR-IOV enabled InfiniBand clusters. To the best of our knowledge, Slurm-V is the first attempt to extend Slurm for the support of running concurrent MPI jobs with isolated SR-IOV and IVShmem resources. The capabilities of Slurm-V can be used to build efficient HPC clouds.
机译:为了减轻成本负担,通过虚拟化向最终用户有效共享HPC群集资源变得越来越有吸引力。在这种情况下,需要启用和隔离虚拟机中的一些关键HPC资源,例如启用单根I / O虚拟化(SR-IOV)的虚拟功能(VF)和VM间共享的内存(IVShmem)设备,以支持在HPC云上高效地运行多个并发MPI作业。但是,原始的Slurm无法管理VM和相关的关键资源,例如VF和IVShmem。本文提出了一个新颖的框架Slurm-V,该框架将Slurm具有面向虚拟化的功能(如作业提交)扩展到具有隔离的SR-IOV和IVShmem资源的动态创建的VM。我们为Slurm-V提出了几种替代设计:基于任务的设计,基于SPANK插件的设计以及基于OpenStack的设计上的SPANK插件,以管理和隔离IVShmem和SR-IOV资源以运行MPI作业。我们从不同情况下的启动性能,可伸缩性和应用程序性能等方面评估这些设计。评估结果表明,通过Slurm SPANK插件中的快照方案,VM启动时间最多可减少2.64倍。我们提出的Slurm-V框架显示了良好的可伸缩性,并具有在启用SR-IOV的InfiniBand群集上高效运行并发MPI作业的能力。据我们所知,Slurm-V是扩展Slurm的首次尝试,以支持使用隔离的SR-IOV和IVShmem资源运行并发MPI作业。 Slurm-V的功能可用于构建高效的HPC云。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号