...
首页> 外文期刊>Concurrency, practice and experience >Exploiting GPU memory hierarchy for accelerating a specialized stencil computation
【24h】

Exploiting GPU memory hierarchy for accelerating a specialized stencil computation

机译:利用GPU内存层次结构来加速专业的模具计算

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Stencil computations are an important class of problems that can benefit from graphics processing units (GPUs). However, given the hierarchical and on‐chip blocked memory organization in GPUs, the memory performance degrades for specific data access patterns in stencils. Hence, we need appropriate data layout to effectively use the different levels of the memory to harvest the full potential of GPUs. In this context, a specialized stencil computation problem, namely, Lattice Boltzmann Method, which has a complex neighborhood relationship along with loop carried dependence, is considered as a strong case study. Four different approaches for the lattice Boltzmann method have been developed in this work by exploiting memory hierarchy with new data layouts and kernel organizations. These methods have been developed with the primary aim of increasing the compute to global memory access ratio and reducing the overall read‐write latency, even at the expense of additional computations. NVIDIA GPUs TitanX, GTX 960, GTX 740Ti, and GTX 650Ti have been used to test the proposed techniques. The compute to global memory access ratio shows an improvement of 2 to 10 times over the naive solutions in this work. The performance, in terms of time taken per iteration, is improved by up to 3.7 times. The million lattice units per second for both 2DQ9 and 3DQ19 models improve by more than 2 times.
机译:模板计算是一类重要的问题,可从图形处理单元(GPU)中受益。但是,考虑到GPU中的分层和片上阻塞存储组织,对于模板中的特定数据访问模式,内存性能会下降。因此,我们需要适当的数据布局以有效地使用内存的不同级别,以充分利用GPU的潜力。在这种情况下,一个特殊的模板计算问题,即具有复杂的邻域关系以及回路承载依赖性的莱迪思玻尔兹曼方法,被认为是一个很好的案例研究。在这项工作中,通过利用具有新数据布局和内核组织的内存层次结构,开发了四种不同的格子Boltzmann方法。开发这些方法的主要目的是提高计算与全局内存的访问比率,并减少总体读写延迟,即使以额外的计算为代价。 NVIDIA GPU TitanX,GTX 960,GTX 740Ti和GTX 650Ti已用于测试建议的技术。在这项工作中,计算与全局内存的访问率比单纯的解决方案提高了2到10倍。就每次迭代所花费的时间而言,性能最多可提高3.7倍。 2DQ9和3DQ19模型的每秒百万晶格单位提高了两倍以上。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号