首页> 外文会议>IEEE International Conference on Cluster Computing >GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters
【24h】

GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters

机译:GGAS:全局GPU地址空间,用于异构集群中的高效通信

获取原文

摘要

Modern GPUs are powerful high-core-count processors, which are no longer used solely for graphics applications, but are also employed to accelerate computationally intensive general-purpose tasks. For utmost performance, GPUs are distributed throughout the cluster to process parallel programs. In fact, many recent high-performance systems in the TOP500 list are heterogeneous architectures. Despite being highly effective processing units, GPUs on different hosts are incapable of communicating without assistance from a CPU. As a result, communication between distributed GPUs suffers from unnecessary overhead, introduced by switching control flow from GPUs to CPUs and vice versa. Most communication libraries even require intermediate copies from GPU memory to host memory. This overhead in particular penalizes small data movements and synchronization operations, reduces efficiency and limits scalability. In this work we introduce global address spaces to facilitate direct communication between distributed GPUs without CPU involvement. Avoiding context switches and unnecessary copying dramatically reduces communication overhead. We evaluate our approach using a variety of workloads including low-level latency and bandwidth benchmarks, basic synchronization primitives like barriers, and a stencil computation as an example application. We see performance benefits of up to 2× for basic benchmarks and up to 1.67× for stencil computations.
机译:现代GPU是功能强大的高核数处理器,它们不再仅用于图形应用程序,而且还用于加速计算密集型通用任务。为了获得最佳性能,GPU分布在整个群集中,以处理并行程序。实际上,TOP500列表中的许多最新高性能系统都是异构体系结构。尽管是高效的处理单元,但不同主机上的GPU仍无法在没有CPU协助的情况下进行通信。结果,分布式GPU之间的通信遭受不必要的开销,这是由于将控制流从GPU切换到CPU而引入的,反之亦然。大多数通信库甚至需要从GPU内存到主机内存的中间副本。这种开销尤其不利于小的数据移动和同步操作,降低了效率并限制了可伸缩性。在这项工作中,我们引入了全局地址空间,以促进分布式GPU之间的直接通信,而无需占用CPU。避免上下文切换和不必要的复制,可大大减少通信开销。我们使用各种工作负载来评估我们的方法,这些工作负载包括低级延迟和带宽基准,基本同步原语(例如屏障)以及模板计算作为示例应用程序。对于基本基准,我们看到了高达2倍的性能优势,对于模板计算而言,我们看到了高达1.67倍的性能优势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号