首页> 外国专利> Optimization of checkpoint operations for deep learning computing

Optimization of checkpoint operations for deep learning computing

页面导航

摘要
著录项
相似文献

摘要

Systems and methods are provided to optimize checkpoint operations for deep learning (DL) model training tasks. For example, a distributed DL model training process is executed to train a DL model using multiple accelerator devices residing on one or more server nodes, and a checkpoint operation is performed to generate and store a checkpoint of an intermediate DL model. A checkpoint operation includes compressing a checkpoint of an intermediate DL model stored in memory of a given accelerator device to generate a compressed checkpoint, and scheduling a time to perform a memory copy operation to transfer a copy of the compressed checkpoint from the memory of the given accelerator device to a host system memory. The scheduling is performed based on information regarding bandwidth usage of a communication link to be utilized to transfer the compressed checkpoint to perform the memory copy operation, wherein the memory copy operation is performed at the scheduled time.

著录项

公开/公告号US10698766B2

专利类型
公开/公告日2020.06.30

原文格式PDF
申请/专利权人
展开▼

申请/专利号US15956193
发明设计人 Junping Zhao;Dragan Savic;
展开▼

申请日2018.04.18
分类号
国家 US
入库时间 2022-08-21 10:58:41

相似文献

专利
外文文献