首页> 外文会议>International conference on Supercomputing >Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems
【24h】

Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

机译:针对混合CPU / GPU系统的经过调整和完全异步的模板内核

获取原文

摘要

We describe heterogeneous multi-CPU and multi-GPU implementations of Jacobi's iterative method for the 2-D Poisson equation on a structured grid, in both single- and double-precision. Properly tuned, our best implementation achieves 98% of the empirical streaming GPU bandwidth (66% of peak) on a NVIDIA C1060, and 78% on a C870. Motivated to find a still faster implementation, we further consider 'wildly asynchronous' implementations that can reduce or even eliminate the synchronization bottleneck between iterations. In these versions, which are based on chaotic relaxation (Chazan and Miranker, 1969), we simply remove or delay synchronization between iterations. By doing so, we trade-off more flops, via more iterations to converge, for a higher degree of asynchronous parallelism. Our wild implementations on a GPU can be 1.2-2.5x faster than our best synchronized GPU implementation while achieving the same accuracy. Looking forward, this result suggests research on similarly 'fast-and-loose' algorithms in the coming era of increasingly massive concurrency and relatively high synchronization or communication costs.
机译:我们以单精度和双精度描述了结构网格上二维Poisson方程的Jacobi迭代方法的异构多CPU和多GPU实现。经过适当调整,我们的最佳实现在NVIDIA C1060上可达到实验性流GPU带宽的98%(峰值的66%),在C870上可达到78%。为了找到一个更快的实现方式,我们进一步考虑了可以减少甚至消除迭代之间的同步瓶颈的“野生异步”实现。在这些基于混沌松弛的版本中(Chazan和Miranker,1969),我们只是删除或延迟了迭代之间的同步。这样,我们可以通过更多迭代来折衷更多触发器,以实现更高程度的异步并行性。在相同的精度下,我们在GPU上的疯狂实现可以比我们最好的同步GPU实现快1.2-2.5倍。展望未来,这一结果表明,在并发性越来越大,同步或通信成本相对较高的时代,对类似“快速松散”算法的研究。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号