首页> 美国卫生研究院文献>other >Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications
【2h】

Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications

机译:在多加速器应用程序中进行有效数据交换的运行时和体系结构支持

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5× for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice.
机译:异构并行计算应用程序通常处理大型数据集,这些数据集需要多个GPU共同满足其对物理内存容量和计算吞吐量的需求。但是,先前的异构并行编程模型缺少高级抽象,这迫使程序员在多个GPU设备之间交换数据时不得不采用多种代码版本,复杂的数据复制步骤和同步方案,这导致软件开发成本高,可维护性差,甚至表现不佳。本文介绍了HPE运行时系统以及相关的体系结构支持,该体系结构支持简单,高效的编程接口,用于通过互连或跨节点网络接口在多个GPU之间交换数据。本文介绍的运行时和体系结构支持也可以用于支持其他类型的加速器。我们表明,简化的编程接口可降低编程复杂性。本文中介绍的研究始于2009年。该研究已在几代HPE运行时系统中进行了广泛的实施和测试,并自2011年以来被用于CUDA 4.0及更高版本的NVIDIA GPU硬件和驱动程序中。通过在实时运行时和硬件上运行重要的基准测试,HPE的关键功能为研究硬件支持的有效性提供了难得的机会。实验结果表明,在示例异构系统中,对等DMA和双缓冲,固定缓冲区以及软件技术可以将加速器之间的数据通信带宽提高2倍。对于3D有限差分,它们还可以将执行速度提高1.6倍,对于1D FFT,执行速度可以提高2.5倍,对于合并排序,则可以提高1.6倍,所有这些都是在真实硬件上测得的。所建议的体系结构支持使HPE运行时能够在简单的可移植用户代码下透明地部署这些优化,从而使系统设计人员可以自由地使用具有不同功能的设备。我们进一步指出,大多数应用程序需要简单的接口(例如HPE)才能从实践中受益于高级硬件功能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号