首页> 外文会议>ACM SIGPLAN symposium on principles and practice of parallel programming >Scalable Framework for Mapping Streaming Applications onto Multi-GPU Systems
【24h】

Scalable Framework for Mapping Streaming Applications onto Multi-GPU Systems

机译:用于将流应用程序映射到多GPU系统的可扩展框架

获取原文

摘要

Graphics processing units leverage on a large array of parallel processing cores to boost the performance of a specific streaming computation pattern frequently found in graphics applications. Unfortunately, while many other general purpose applications do exhibit the required streaming behavior, they also possess unfavorable data layout and poor computation-to-communication ratios that penalize any straight-forward execution on the GPU. In this paper we describe an efficient and scalable code generation framework that can map general purpose streaming applications onto a multi-GPU system. This framework spans the entire core and memory hierarchy exposed by the multi-GPU system. Several key features in our framework ensure the scalability required by complex streaming applications. First, we propose an efficient stream graph partitioning algorithm that partitions the complex application to achieve the best performance under a given shared memory constraint. Next, the resulting partitions are mapped to multiple GPUs using an efficient architecture-driven strategy. The mapping balances the workload while considering the communication overhead. Finally, a highly effective pipeline execution is employed for the execution of the partitions on the multi-GPU system. The framework has been implemented as a back-end of the Streamlt programming language compiler. Our comprehensive experiments show its scalability and significant performance speedup compared with a previous state-of-the-art solution.
机译:图形处理单元利用大量并行处理核心,以提高图形应用程序中常见的特定流计算模式的性能。不幸的是,尽管许多其他通用应用程序确实表现出所需的流传输行为,但它们还具有不利的数据布局和较差的计算通信比,这不利于GPU上的任何直接执行。在本文中,我们描述了一种高效且可扩展的代码生成框架,该框架可将通用流应用程序映射到多GPU系统上。该框架涵盖了多GPU系统公开的整个核心和内存层次结构。我们框架中的几个关键功能确保了复杂流应用程序所需的可伸缩性。首先,我们提出了一种有效的流图分区算法,该算法对复杂的应用程序进行分区,以在给定的共享内存约束下实现最佳性能。接下来,使用高效的架构驱动策略将生成的分区映射到多个GPU。映射在考虑通信开销的同时平衡了工作负载。最后,在多GPU系统上执行高效的流水线执行以执行分区。该框架已实现为Streamlt编程语言编译器的后端。与以前的最新解决方案相比,我们全面的实验表明其可扩展性和显着的性能提升。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号