Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-Enabled Systems

机译：利用最大重叠量在现代启用GPU的系统上进行非连续数据移动处理

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

GPU accelerators are widely used in HPC clusters due to their massive parallelism and high throughput-per-watt. Data movement continues to be the major bottleneck on GPU clusters, more so when data is non-contiguous, which is common in scientific applications. CUDA-Aware MPI libraries optimize the non-contiguous data movement processing using latency oriented techniques such as using GPU kernels to accelerate the packing/unpacking operations. Although they optimize the latency of a single operation, the inherent restrictions of the designs limit their efficiency for throughput oriented patterns. Indeed, none of the existing designs fully exploit the massive parallelism of the GPUs to provide high throughput and efficient resources utilization by enabling maximal overlap. In this paper, we propose novel designs for CUDA-Aware MPI libraries to achieve efficient GPU resource utilization and maximal overlap between CPUs and GPUs for non-contiguous data processing and movement. The proposed designs take advantage of several CUDA features, such as Hyper-Q/multi-streams and callback function, to deliver high performance and efficiency. To the best of our knowledge, this is the first such study to provide high throughput and efficient resource utilization for non-contiguous MPI data processing and movement to/from GPUs. The performance evaluation with the proposed designs using DDTBench shows up to 54%, 67%, 61% performance improvement on the SPECFEM3D_oc, SPECFEM3D_cm and WRF_y_sa benchmarks respectively for intra-node inter-GPU ping-pong experiments. The proposed designs also deliver up to 33% improvement on the total execution time over the existing designs for the HaloExchange-based application kernel that models the communication pattern of the MeteoSwiss weather forecasting model over 32 GPU nodes on Wilkes GPU cluster.

机译：GPU加速器因其大规模的并行性和高每瓦吞吐量而被广泛用于HPC集群中。数据移动仍然是GPU集群的主要瓶颈，尤其是在数据不连续的情况下（这在科学应用中很常见）。 CUDA感知的MPI库使用面向延迟的技术（例如，使用GPU内核来加速打包/拆包操作）来优化非连续数据移动处理。尽管它们优化了单个操作的等待时间，但是设计的固有限制限制了它们在面向吞吐量的模式方面的效率。确实，现有设计都没有充分利用GPU的大规模并行性来实现最大的重叠，从而提供高吞吐量和有效的资源利用。在本文中，我们为CUDA感知MPI库提出了新颖的设计，以实现有效的GPU资源利用以及CPU和GPU之间最大的重叠，以实现非连续的数据处理和移动。拟议的设计利用了一些CUDA功能（例如Hyper-Q /多流和回调功能）来提供高性能和效率。据我们所知，这是第一个为非连续MPI数据处理以及往返于GPU的移动提供高吞吐量和有效资源利用的研究。使用DDTBench的建议设计进行的性能评估显示，分别针对节点内GPU间乒乓实验，SPECFEM3D_oc，SPECFEM3D_cm和WRF_y_sa基准分别提高了54％，67％，61％。与基于HaloExchange的应用程序内核的现有设计相比，拟议的设计还使总执行时间缩短了33％，后者可对Wilkes GPU集群上32个GPU节点上的MeteoSwiss天气预报模型的通信模式进行建模。

著录项

来源
《IEEE International Parallel and Distributed Processing Symposium》|2016年|983-992|共10页
会议地点
作者
C-H. Chu; K. Hamidouche; A. Venkatesh; D. S. Banerjee; H. Subramoni; Dhabaleswar K. Panda;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Graphics processing units; Kernel; Programming; Parallel processing; Synchronization; Throughput; Data processing;

机译：图形处理单元;内核;编程;并行处理;同步;吞吐量;数据处理;

相似文献

外文文献
中文文献
专利

1. Exploiting capabilities of modern processors in data intensive applications [J] . David Broneske, Gunter Saake Information Technology . 2017,第3期

机译：利用数据密集型应用中的现代处理器的能力
2. A graphical, interactive and GPU-enabled workflow to process long-read sequencing data [J] . Reddy Shishir, Hung Ling-Hong, Sala-Torra Olga, BMC Genomics . 2021,第1期

机译：启用了一种用于处理长读取测序数据的图形，交互式和GPU的工作流程
3. Exploiting class hierarchy for effective parallelisation of processing in object-oriented database systems [J] . Krzysztof Goczyla Journal of systems architecture . 1997,第1a5期

机译：利用类层次结构，以实现面向对象数据库系统处理的有效平行化
4. Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems [C] . C. H. Chu, K. Hamidouche, A. Venkatesh, IEEE International Parallel and Distributed Processing Symposium . 2016

机译：利用现代GPU系统上的非连续数据移动处理的最大重叠
5. Structure in Modern Data and How to Exploit It: Some Signal Processing Applications [D] . Lodhi, Muhammad Asad. 2020

机译：现代数据中的结构以及如何利用它：一些信号处理应用程序
6. IV. Clinical Consultation Systems Medical Decision Support Systems and Clinical Research Data Bases: B. Clinical Research Data Bases: The Use of a Microcomputer as a Front-End Processor for Data Base Management Systems on Large Computers [O] . Lee Blumenthal, John Waterson 1981

机译：IV。临床咨询系统医疗决策支持系统和临床研究数据库：B.临床研究数据库：将微型计算机用作大型计算机数据库管理系统的前端处理器
7. Compiler Optimizations for Non-Contiguous Remote Data Movement [O] . Timo Schneider, Robert Gerstenberger, Torsten Hoefler 2014

机译：非连续远程数据移动的编译器优化
8. Integro-Differential Equation Analysis of Non-Contiguous Partially Compartmented Systems and Practical Development of New Radioisotope Imaging and Data Processing Techniques [R] . Noshita, K. 1980

机译：非连续部分隔离系统的积分微分方程分析及新放射性同位素成像和数据处理技术的实际发展

Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-Enabled Systems

摘要

著录项

相似文献

相关主题

期刊订阅