...
首页> 外文期刊>Software, practice & experience >PFACC: An OpenACC-like programming model for irregular nested parallelism
【24h】

PFACC: An OpenACC-like programming model for irregular nested parallelism

机译:PFACC:不规则嵌套并行性的OpenACC等编程模型

获取原文
获取原文并翻译 | 示例
           

摘要

OpenACC is a directive-based programming model which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration-sharing and memory allocation routines. The PFACC runtime iteration-sharing routine is a two-level mechanism. Thread blocks dynamically organize loop iterations intobatchesand execute the batches in a depth-first order. Different thread blocks share iterations among one another with an iteration-stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth-first execution order. The two-level iteration-sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks.
机译:OpenACC是一种基于指令的编程模型,允许程序员通过简单地注释并行环路来编写图形处理单元(GPU)程序。但是,OpenACC对不规则嵌套并行环路的支持不佳,这是表达嵌套并行性的自然选择。我们提出PFACC,一个类似于OpenACC的编程模型。 PFACC指令可用于注释并行循环,并指导不同级别的内存层级之间的数据移动。并行环路可以任意嵌套或放置在其他并行环路中(可能递归)的功能中被置于函数中。 PFACC Translator通过插入运行时迭代共享和内存分配例程将带有PFACC指令的C程序转换为CUDA程序。 PFACC运行时迭代共享例程是双级机制。线程块动态组织循环迭代intobatchesand以深度第一顺序执行批处理。不同的线程块以迭代窃取机制彼此共享迭代。 PFACC由于深度第一执行顺序而产生具有合理内存使用的CUDA程序。双层迭代共享机制纯粹在软件中实现,并与CUDA线程层次结构良好。实验表明,在大多数基准上的性能和代码大小方面,PFACC优于CUDA动态并行性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号