首页> 外文会议>Asia-Pacific Conference on Computer Aided System Engineering >Automatic Parallelization of GPU Applications Using OpenCL
【24h】

Automatic Parallelization of GPU Applications Using OpenCL

机译:使用OpenCL自动并行化GPU应用程序

获取原文

摘要

Graphics Processing Units (GPUs) have been successfully used to accelerate scientific applications due to their computation power and the availability of programming languages that make more approachable writing scientific applications for GPUs. However, since the programming model of GPUs requires offloading all the data to the GPU memory, the memory footprint of the application is limited to the size of the GPU memory. Multi-GPU systems can make memory limited problems tractable by parallelizing the computation and data among the available GPUs. Parallelizing applications written for running on single-GPU systems can be done (i) at runtime through an environment that captures the memory operations and kernel calls and distributes among the available GPUs, and (ii) at compile time through a pre-compiler that transforms the application for decomposing the data and computation among the available GPUs. In this paper we propose a framework and implement a tool that transforms an OpenCL application written to run on single-GPU systems into one that runs on multi-GPU systems. Based on data dependencies and data usage analysis, the application is transformed to decompose data and computation among the available GPUs. To reduce the data transfer overhead, computation-communication overlapping techniques are utilized. We tested our tool using two applications with different data transfer requirements, for the application with no data transfer requirements, a linear speedup is achieved, while for the application with data transfers, the computation-communication overlapping reduces the communication overhead by 40%.
机译:由于图形处理单元(GPU)的计算能力和编程语言的可用性,这些图形处理单元(GPU)已成功用于加速科学应用程序,从而使GPU编写科学应用程序的方式更加平易近人。但是,由于GPU的编程模型要求将所有数据卸载到GPU内存,因此应用程序的内存占用空间仅限于GPU内存的大小。多GPU系统可以通过并行化可用GPU之间的计算和数据来解决内存有限的问题。为在单GPU系统上运行而编写的并行应用程序可以(i)在运行时通过捕获内存操作和内核调用并在可用GPU之间分配的环境来完成,以及(ii)在编译时通过进行转换的预编译器来完成。用于在可用GPU之间分解数据和计算的应用程序。在本文中,我们提出了一个框架并实现了一种工具,该工具可以将编写用于在单GPU系统上运行的OpenCL应用程序转换为可以在多GPU系统上运行的应用程序。基于数据依赖关系和数据使用情况分析,可以对应用程序进行转换,以分解可用GPU之间的数据和计算。为了减少数据传输开销,利用了计算通信重叠技术。我们使用两个具有不同数据传输要求的应用程序测试了我们的工具,对于不具有数据传输要求的应用程序,实现了线性加速,而对于具有数据传输的应用程序,计算-通信重叠将通信开销减少了40%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号