首页> 外文会议>IEEE International Parallel and Distributed Processing Symposium Workshops >Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences
【24h】

Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences

机译:缩放现代HPC集群的单图像超分辨率培训:早期经验

获取原文

摘要

Deep Learning (DL) models for super-resolution (DLSR) are an emerging trend in response to the growth of ML/DL applications requiring high-resolution images. DLSR methods have also shown promise in domains such as medical imaging, surveillance, and microscopy. However, DLSR models are extremely computationally demanding, and require unreasonably long training times on modern Volta GPUs. In our experiments, we observed only 10.3 images/second on a single Volta GPU for training EDSR, a state-of-the-art DLSR model for single-image super-resolution. In comparison, a Volta GPU can process 360 images/second while training ResNet-50, a state-of-the-art model for image classification. Therefore, we believe supercomputers provide a good candidate to speed up DLSR model training. In this paper, we select EDSR as the representative DLSR PyTorch model. Further, we introduce Horovod-based distributed EDSR training. However, we observed poor default EDSR scaling performance on the Lassen HPC system at Lawrence Livermore National Laboratory. To investigate the performance degradations, we perform exhaustive communication profiling. These profiling insights are then used to optimize CUDA-Aware MPI for DLSR models by ensuring advanced MPI designs involving CUDA IPC and registration caching are properly applied by DL frameworks. We present a comprehensive scaling study of EDSR with MVAPICH2-GDR and NCCL up to 512 GPUs on Lassen. We demonstrate an improvement in scaling efficiency by 15.6% over default Horovod training, which translates to a 1.26× speedup in training performance.
机译:超分辨率(DLSR)的深度学习(DL)模型是响应需要高分辨率图像的ML / DL应用的生长的新兴趋势。 DLSR方法还在域中显示了诸如医学成像,监测和显微镜的域中的承诺。然而,DLSR模型非常苛刻,并且在现代Volta GPU上需要不合理的长期培训时间。在我们的实验中,我们在单个Volta GPU上仅观察到了10.3个图像/秒,用于训练EDSR,是单图像超分辨率的最先进的DLSR模型。相比之下,Volta GPU可以在训练Reset-50的同时处理360图像/第二,用于图像分类的最先进的模型。因此,我们认为超级计算机提供了良好的候选人,以加快DLSR模型培训。在本文中,我们选择EDSR作为代表性DLSR Pytorch模型。此外,我们介绍了基于Horovod的分布式EDSR培训。但是,我们在Lawrence Livermore国家实验室观察到Lassen HPC系统上的违约违约表现差。为了调查性能下降,我们执行详尽的沟通分析。然后,通过确保涉及CUDA IPC的先进MPI设计,通过确保CUDA IPC的先进MPI设计来优化DLSR模型的CUDA感知MPI,并通过DL框架适当地应用登记缓存。我们在Lassen展示了MVAPICH2-GDR和NCCL的全面缩放研究,最高可达512个GPU。我们通过默认HOROVOD培训展示了15.6%的缩放效率的提高,转化为1.26倍的训练性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号