首页> 外文会议>2012 IEEE 18th International Conference on Parallel and Distributed Systems. >Optimizing Dynamic Programming on Graphics Processing Units Via Data Reuse and Data Prefetch with Inter-Block Barrier Synchronization
【24h】

Optimizing Dynamic Programming on Graphics Processing Units Via Data Reuse and Data Prefetch with Inter-Block Barrier Synchronization

机译:通过块间屏障同步通过数据重用和数据预取来优化图形处理单元上的动态编程

获取原文
获取原文并翻译 | 示例

摘要

Our previous study focused on accelerating an important category of DP problems, called nonserial polyadic dynamic programming (NPDP), on a graphics processing unit (GPU). In NPDP applications, the degree of parallelism varies significantly in different stages of computation, making it difficult to fully utilize the compute power of hundreds of pro-cessing cores in a GPU. To address this challenge, we proposed a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages. This work aims at further improving the performance of NPDP problems. Sub problems and data are tiled to make it possible to fit small data regions into shared memory and reuse the buffered data for each tile of sub problems, thus reducing the amount of global memory access. However, we found invoking the same kernel many times, due to data consistency enforcement across different stages, makes it impossible to reuse the tiled data in shared memory after the kernel is invoked again. Fortunately, the inter-block synchronization technique allows us to invoke the kernel exactly one time with the restriction that the maximum number of blocks is equal to the total number of streaming multiprocessors. In addition to data reuse, invoking the kernel only one time also enables us to prefetch data to shared memory across inter-block synchronization point, which improves the performance more than data reuse. We realize our approach in a real-world NPDP application â" the optimal matrix parenthesization problem. Experimental results demonstrate invoking a kernel only one time cannot guarantee performance improvement unless we also reuse and prefetch data across barrier synchronization points.
机译:我们之前的研究重点是在图形处理单元(GPU)上加速一类重要的DP问题,称为非串行多态动态编程(NPDP)。在NPDP应用程序中,并行度在计算的不同阶段有很大差异,因此难以充分利用GPU中数百个处理内核的计算能力。为了解决这一挑战,我们提出了一种方法,该方法可以在将NPDP问题映射到GPU时自适应地调整线程级并行度,从而在不同的计算阶段提供足够稳定的并行度。这项工作旨在进一步提高NPDP问题的性能。对子问题和数据进行平铺以使其有可能将较小的数据区域放入共享内存中,并为每个子问题平铺重新使用缓冲的数据,从而减少了对全局内存的访问量。但是,由于跨不同阶段强制执行数据一致性,我们发现多次调用同一内核,使得在再次调用内核之后无法重用共享内存中的切片数据。幸运的是,块间同步技术允许我们一次精确地调用内核,但要限制最大块数等于流式多处理器的总数。除了数据重用之外,仅调用一次内核还使我们能够跨块间同步点将数据预取到共享内存,这比数据重用提高了更多性能。我们在现实的NPDP应用程序中实现了我们的方法,即最佳矩阵括号问题。实验结果表明,仅调用内核一次就不能保证性能的提高,除非我们在屏障同步点之间重用和预取数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号