【24h】

Towards Cross-Platform Performance Portability of DNN Models using SYCL

机译:使用SYCL跨平台模型的跨平台性能可移植性

获取原文

摘要

The incoming deployment of Exascale platforms with a myriad of different architectures and co-processors have prompted the need to provide a software ecosystem based on open standards that can simplify maintaining HPC applications on different hardware. Applications written for a particular platform should be portable to a different one, ensuring performance is as close to the peak as possible. However, it is not expected that key performance routines on relevant HPC applications will be performance portable as is, especially for common building blocks such as BLAS or DNN. The oneAPI the initiative aims to tackle this problem by combining a programming model, SYCL, with a set of interfaces for common building blocks that can be optimized for different hardware vendors. In particular, oneAPI includes the oneDNN performance library, which contains building blocks for deep learning applications and frameworks. By using the SYCL programming model, it can integrate easily with existing SYCL and C++ applications, sharing data and executing collaboratively on devices with the rest of the application. In this paper, we introduce a cuDNN backend for oneDNN, which allows running oneAPI applications on NVIDIA hardware taking advantage of existing building blocks from the CUDA ecosystem. We implement relevant neural networks (ResNet-50 and VGG- 16) on native CUDA and also a version of oneAPI with a CUDA backend, and demonstrate that performance portability can be achieved by leveraging existing building blocks for the target hardware.
机译:ExaScale平台的传入部署与多数不同的架构和协处理器促使需要根据开放标准提供软件生态系统,可以简化在不同硬件上维护HPC应用程序。为特定平台编写的应用程序应携带到不同的平台,确保性能尽可能靠近峰值。但是,预计相关HPC应用程序上的关键性能例程将是性能,尤其适用于诸如BLAS或DNN之类的公共构建块。 ONEAPI主动旨在通过将编程模型Sycl组合使用一组可针对不同的硬件供应商进行优化的通用构建块的一组接口来解决此问题。特别是,ONEAPI包括Onednn性能库,其中包含深度学习应用程序和框架的构建块。通过使用Sycl编程模型,它可以轻松地与现有Sycl和C ++应用程序集成,共享数据和在具有其余应用程序的设备上协作执行。在本文中,我们为Onednn介绍了CUDNN后端,它允许在NVIDIA硬件上运行ONEAPI应用,利用来自CUDA生态系统的现有构建块。我们在原生CUDA上实施相关的神经网络(Reset-50和VGG-16),也是具有CUDA后端的ONEAPI版本,并证明可以通过利用目标硬件的现有构件块来实现性能便携性。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号