首页> 外文期刊>Computing >A new memory mapping mechanism for GPGPUs' stencil computation
【24h】

A new memory mapping mechanism for GPGPUs' stencil computation

机译:GPGPU模板计算的新内存映射机制

获取原文
获取原文并翻译 | 示例

摘要

When optimizing performance on a GPU, control flow divergence of threads in one warp can make up the possible performance bottlenecks. In our hand-coded GPU stencil computation optimization, with a view to remove this control flow divergence brought by conventional mapping method between global memory and shared memory, we devise a new mapping mechanism by modeling the coalesced memory accesses of GPU threads and the aligned ghost zone overheads to remove conditional statements of the boundary XY-tile stencil computation points for improved performance. In addition, we utilize only one XY-tile loaded into registers in every stencil computation iteration, common sub-expression elimination and software prefetching to reduce overheads. Finally, detailed performance evaluation demonstrates that global memory access traffic is close to the idealized lower bound value through our optimized policies, that is to say, in every computed point of one XY-tile the memory access traffic is roughly 6 and 4 % more than 8 bytes per XY-tile point of the idealized lower bound memory access traffic in which ghost zone overheads are not taken into consideration on Tesla C2050 and Kepler K20X respectively.
机译:在GPU上优化性能时,一次扭曲中线程的控制流差异可能会弥补性能瓶颈。在我们手工编码的GPU模板计算优化中,为了消除全局内存和共享内存之间的常规映射方法带来的这种控制流差异,我们通过对GPU线程和对齐的重影的合并内存访问进行建模,设计了一种新的映射机制。区域开销以删除边界XY-tile模板计算点的条件语句,以提高性能。此外,在每次模板计算迭代,通用子表达式消除和软件预取中,我们仅利用一个加载到寄存器中的XY瓦片来减少开销。最后,详细的性能评估表明,通过我们的优化策略,全局内存访问流量接近理想的下限值,也就是说,在每个XY平铺的每个计算点上,内存访问流量大约比XY平铺多6%和4%理想的下限内存访问流量的每个XY平铺点8字节,其中在Tesla C2050和Kepler K20X上分别没有考虑重影区开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号