Optimizing Dynamic Programming on Graphics Processing Units Via Data Reuse and Data Prefetch with Inter-Block Barrier Synchronization

机译：通过块间屏障同步通过数据重用和数据预取来优化图形处理单元上的动态编程

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Our previous study focused on accelerating an important category of DP problems, called nonserial polyadic dynamic programming (NPDP), on a graphics processing unit (GPU). In NPDP applications, the degree of parallelism varies significantly in different stages of computation, making it difficult to fully utilize the compute power of hundreds of pro-cessing cores in a GPU. To address this challenge, we proposed a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages. This work aims at further improving the performance of NPDP problems. Sub problems and data are tiled to make it possible to fit small data regions into shared memory and reuse the buffered data for each tile of sub problems, thus reducing the amount of global memory access. However, we found invoking the same kernel many times, due to data consistency enforcement across different stages, makes it impossible to reuse the tiled data in shared memory after the kernel is invoked again. Fortunately, the inter-block synchronization technique allows us to invoke the kernel exactly one time with the restriction that the maximum number of blocks is equal to the total number of streaming multiprocessors. In addition to data reuse, invoking the kernel only one time also enables us to prefetch data to shared memory across inter-block synchronization point, which improves the performance more than data reuse. We realize our approach in a real-world NPDP application â" the optimal matrix parenthesization problem. Experimental results demonstrate invoking a kernel only one time cannot guarantee performance improvement unless we also reuse and prefetch data across barrier synchronization points.

机译：我们之前的研究重点是在图形处理单元（GPU）上加速一类重要的DP问题，称为非串行多态动态编程（NPDP）。在NPDP应用程序中，并行度在计算的不同阶段有很大差异，因此难以充分利用GPU中数百个处理内核的计算能力。为了解决这一挑战，我们提出了一种方法，该方法可以在将NPDP问题映射到GPU时自适应地调整线程级并行度，从而在不同的计算阶段提供足够稳定的并行度。这项工作旨在进一步提高NPDP问题的性能。对子问题和数据进行平铺以使其有可能将较小的数据区域放入共享内存中，并为每个子问题平铺重新使用缓冲的数据，从而减少了对全局内存的访问量。但是，由于跨不同阶段强制执行数据一致性，我们发现多次调用同一内核，使得在再次调用内核之后无法重用共享内存中的切片数据。幸运的是，块间同步技术允许我们一次精确地调用内核，但要限制最大块数等于流式多处理器的总数。除了数据重用之外，仅调用一次内核还使我们能够跨块间同步点将数据预取到共享内存，这比数据重用提高了更多性能。我们在现实的NPDP应用程序中实现了我们的方法，即最佳矩阵括号问题。实验结果表明，仅调用内核一次就不能保证性能的提高，除非我们在屏障同步点之间重用和预取数据。

著录项

来源
《2012 IEEE 18th International Conference on Parallel and Distributed Systems.》|2012年|p.45-52|共8页
会议地点 Singapore(SG);Singapore(SG)
作者
Wu Chao-Chin; Wei Kai-Cheng; Lin Ting-Hong;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类分布式操作系统、并行式操作系统;分布式操作系统、并行式操作系统;
关键词
GPU; dynamic programming; optimization; parallel computing; tiling;

机译：GPU;动态编程;优化;并行计算;平铺;;

相似文献

外文文献
中文文献
专利

1. FamSeq: A Variant Calling Program for Family-Based Sequencing Data Using Graphics Processing Units [J] . Gang Peng, Yu Fan, Wenyi Wang PLoS Computational Biology . 2014,第10期

机译：FamSeq：使用图形处理单元的基于族的测序数据的变体调用程序
2. GICUDA: A parallel program for 3D correlation imaging of large scale gravity and gravity gradiometry data on graphics processing units with CUDA [J] . Zhaoxi Chen, Xiaohong Meng, Lianghui Guo, Computers & geosciences . 2012,第期

机译：GICUDA：并行程序，用于使用CUDA在图形处理单元上进行大规模重力和重力梯度数据的3D相关成像
3. Graphics processing unit-accelerated bounding for branch-and-bound applied to a permutation problem using data access optimization [J] . N. Melab, Chakroun, A. Bendjoudi Concurrency and Computation . 2014,第16期

机译：使用数据访问优化将图形处理单元的边界加速应用于置换问题
4. Optimizing Dynamic Programming on Graphics Processing Units Via Data Reuse and Data Prefetch with Inter-Block Barrier Synchronization [C] . Wu Chao-Chin, Wei Kai-Cheng, Lin Ting-Hong IEEE International Conference on Parallel and Distributed Systems;International Workshop on Cloud Services and Systems;International Workshop on Cloud Performance Enhancement;International Workshop on Scalable Computing for Big Data Analytics;International Workshop on Parallel and Distributed Computing in Remote Sensing;International Workshop on Distributed Communication Network System . 2012

机译：通过数据重用和数据预取具有块间阻挡同步的数据重用优化图形处理单元的动态编程
5. Parallel Algorithms and Dynamic Data Structures on the Graphics Processing Unit: a Warp-Centric Approach [D] . Ashkiani, Saman. 2017

机译：图形处理单元上的并行算法和动态数据结构：以翘曲为中心的方法
6. CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units [O] . Yongchao Liu, Douglas L Maskell, Bertil Schmidt 2009

机译：CUDASW ++：优化Smith-Waterman序列数据库搜索以启用CUDA的图形处理单元
7. Large scale bioinformatics data mining with parallel genetic programming on graphics processing units [O] . W. B. Langdon 2013

机译：图形处理单元上具有并行遗传编程的大规模生物信息学数据挖掘

Optimizing Dynamic Programming on Graphics Processing Units Via Data Reuse and Data Prefetch with Inter-Block Barrier Synchronization

摘要

著录项

相似文献

相关主题

期刊订阅