首页> 外文会议>Parallel and Distributed Computing, Applications and Technologies, 2009 >CheCUDA: A Checkpoint/Restart Tool for CUDA Applications
【24h】

CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

机译:CheCUDA:CUDA应用程序的检查点/重新启动工具

获取原文

摘要

In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not support checkpointing the GPU status, CheCUDA hooks a part of basic CUDA driver API calls in order to record the status changes on the main memory. At checkpointing, CheCUDA stores the status changes in a file after copying all necessary data in the video memory to the main memory and then disabling the CUDA runtime. At restarting, CheCUDA reads the file, re-initializes the CUDA runtime, and recovers the resources on GPUs so as to restart from the stored status. This paper demonstrates that a prototype implementation of CheCUDA can correctly checkpoint and restart a CUDA application written with basic APIs. This also indicates that CheCUDA can migrate a process from one PC to another even if the process uses a GPU. Accordingly, CheCUDA is useful not only to enhance the dependability of CUDA applications but also to enable dynamic task scheduling of CUDA applications required especially on heterogeneous GPU cluster systems. This paper also shows the timing overhead for checkpointing.
机译:在本文中,名为CheCUDA的工具旨在检查使用GPU作为加速器的CUDA应用程序。由于现有的检查点/重新启动实现不支持对GPU状态进行检查,因此,CHECUDA会钩住基本CUDA驱动程序API调用的一部分,以便将状态更改记录在主内存上。在检查点,将所有必要的数据复制到视频存储器到主存储器中,然后禁用CUDA运行时,CheCUDA将状态更改存储在文件中。重新启动时,CheCUDA读取文件,重新初始化CUDA运行时,并恢复GPU上的资源,以便从存储的状态重新启动。本文演示了CheCUDA的原型实现可以正确地检查点并重新启动使用基本API编写的CUDA应用程序。这也表明即使进程使用GPU,CheCUDA也可以将进程从一台PC迁移到另一台PC。因此,CheCUDA不仅可用于增强CUDA应用程序的可靠性,而且还可用于实现特别在异构GPU集群系统上所需的CUDA应用程序的动态任务调度。本文还显示了检查点的时间开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号