【24h】

A Scalable Framework for Heterogeneous GPU-Based Clusters

机译:基于异构GPU的集群的可扩展框架

获取原文
获取原文并翻译 | 示例

摘要

GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all CPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multilevel partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed-memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We then devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid GPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a GUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system [25] using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework is able to attain high performance on distributed-memory clusters without CPUs, and shared-system multiGPUs.
机译:基于GPU的异构集群由于其高能效和大大提高的单节点计算性能而继续引起供应商和HPC用户的关注,但是,几乎没有可用的并行软件可利用异构系统上的所有CPU内核和所有CPU。有效率的。在异构群集上,GPU(或计算节点)的性能以比PCI-Express连接(或互连网络)的性能快得多的速率增加,从而通信最终成为整个系统的瓶颈。为克服瓶颈,我们开发了一种多级分区和分发方法,可确保近乎最佳的通信量。我们还扩展了异构切片算法,以在分布式内存GPU群集上工作。我们的主要思想是执行一个串行程序并生成混合大小的任务,并遵循数据流编程模型在不同的计算节点上触发任务。然后,我们设计了一个分布式动态调度运行时系统来调度任务,并在混合GPU-GPU计算节点之间透明地传输数据。运行时系统采用新颖的分布式任务分配协议来解决任务之间的数据依赖性,而无需处理单元之间的协调。每个节点上的运行时系统由多个CPU计算线程,多个GPU计算线程,任务生成线程,MPI通信线程和GUDA通信线程组成。通过重叠的计算和动态调度通信,我们能够在使用100个节点(每个节点具有十二个CPU内核和三个GPU)的异构Keeneland系统[25]上实现75 TFlops的Cholesky分解的高性能。此外,我们的框架能够在没有CPU和共享系统multiGPU的分布式内存集群上实现高性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号