首页> 外文会议>The 39th International Conference on Parallel Processing >Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs
【24h】

Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs

机译:面向多GPGPU的DOACROSS并行化

获取原文

摘要

To exploit the full potential of GPGPUs for general purpose computing, DOACR parallelism abundant in scientific and engineering applications must be harnessed. However, the presence of cross-iteration data dependences in DOACR loops poses an obstacle to execute their computations concurrently using a massive number of fine-grained threads. This work focuses on iterative PDE solvers rich in DOACR parallelism to identify optimization principles and strategies that allow their efficient mapping to GPGPUs. Our main finding is that certain DOACR loops can be accelerated further on GPGPUs if they are algorithmically restructured (by a domain expert) to be more amendable to GPGPU parallelization, judiciously optimized (by the compiler), and carefully tuned by a performance-tuning tool. We substantiate this finding with a case study by presenting a new parallel SSOR method that admits more efficient data-parallel SIMD execution than red-black SOR on GPGPUs. Our solution is obtained non-conventionally, by starting from a K-layer SSOR method and then parallelizing it by applying a non-dependence-preserving scheme consisting of a new domain decomposition technique followed by a generalized loop tiling. Despite its relatively slower convergence, our new method outperforms red-black SOR by making a better balance between data reuse and parallelism and by trading off convergence rate for SIMD parallelism. Our experimental results highlight the importance of synergy between domain experts, compiler optimizations and performance tuning in maximizing the performance of applications, particularly PDE-based DOACR loops, on GPGPUs.
机译:为了充分发挥GPGPU在通用计算方面的潜力,必须利用在科学和工程应用中丰富的DOACR并行性。但是,DOACR循环中存在交叉迭代数据相关性,这给使用大量细粒度线程同时执行其计算构成了障碍。这项工作集中在具有DOACR并行性的迭代PDE求解器上,以识别允许其有效映射到GPGPU的优化原理和策略。我们的主要发现是,如果将某些DOACR循环(通过领域专家)进行算法重组(由领域专家)以使其更适合GPGPU并行化,明智地优化(由编译器)并通过性能调整工具进行仔细调整,则可以在GPGPU上进一步加速。 。我们通过提供一个新的并行SSOR方法,通过一个案例研究来证实这一发现,该方法比GPGPU上的红黑SOR允许更有效的数据并行SIMD执行。我们的解决方案是通过从K层SSOR方法开始,然后通过应用由新域分解技术和广义循环平铺组成的不依赖于保留方案并行化来非常规地获得的。尽管收敛速度相对较慢,但我们的新方法通过在数据重用性和并行性之间取得更好的平衡,并通过权衡SIMD并行性的收敛速度,胜过了红黑SOR。我们的实验结果强调了领域专家,编译器优化和性能调整之间的协同作用对于最大化应用程序(尤其是GPGPU上基于PDE的DOACR循环)的性能的重要性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号