...
首页> 外文期刊>Journal of supercomputing >Job migration in HPC clusters by means of checkpoint/restart
【24h】

Job migration in HPC clusters by means of checkpoint/restart

机译:通过检查点/重启在HPC集群中进行作业迁移

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Until now, jobs running on HPC clusters were tied to the node where their execution started. We have removed that limitation by integrating a user-level checkpoint/restart library into a resource manager, fully transparent to both the user and running application. This opens the door to a whole new set of tools and scheduling possibilities based on the fact that jobs can be migrated, checkpointed, and restarted on a different place or in a different moment, while providing fault tolerance for every job running on the cluster. This is of utmost importance in the future generation of exascale HPC clusters, where an increasing degree and complexities of efficient scheduling make it challenging to obtain the required degree of parallelism demanded by the applications.
机译:到目前为止,在HPC群集上运行的作业都与开始执行的节点绑定。我们通过将用户级别的检查点/重新启动库集成到资源管理器中,从而消除了该限制,这对于用户和正在运行的应用程序都是完全透明的。这可以打开一整套全新的工具,并根据可以在不同地点或不同时刻迁移,检查点和重新启动作业的事实来调度可能性,同时为群集上运行的每个作业提供容错能力。这在未来的亿亿级HPC集群中至关重要,在这种集群中,高效调度的程度和复杂性越来越高,很难获得应用程序所需的并行度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号