首页> 外文会议>International Conference on Parallel Processing Workshops >Optimizing GPGPU Kernel Summation for Performance and Energy Efficiency
【24h】

Optimizing GPGPU Kernel Summation for Performance and Energy Efficiency

机译:优化GPGPU内核汇总以提高性能和能源效率

获取原文

摘要

Kernel summation is a widely used computational kernel that involves matrix-matrix multiplication (GEMM) and matrix-vector multiplication (GEMV) computational primitives. The parallelism exhibited in kernel summation suggests performance improvement when running on GPGPU. State of the art GPU solutions apply cuBLAS library but cannot exploit much of the data locality because intermediate results are written back to main memory in between key operations. This paper presents an optimized implementation that yields better performance and high energy efficiency. Our contributions are fusing all steps of kernel summation into the matrix multiplication code structure and optimizing memory access ordering to make good use of shared memory and cache hierarchy. We decompose the kernel summation problem into individual tasks with few dependencies and strike a balance between finer grained parallelism and reduced data replication. Based on hardware characteristics, we map threads to matrix elements in an interleaved way, and reposition matrix elements to avoid shared memory load and store bank conflicts. We also apply double buffering to hide memory access latency. We analyze both performance and energy benefits of our fused kernel summation compared with the implementation based on cuBLAS. We show that in low dimensions our approach achieves a speedup of up to 1.8X, and saves up to 33% of total energy in all tested problem sizes.
机译:内核求和是一种广泛使用的计算内核,涉及矩阵矩阵乘法(GEMM)和矩阵向量乘法(GEMV)计算原语。内核求和中表现出的并行性表明在GPGPU上运行时的性能提高。最先进的GPU解决方案采用了cuBLAS库,但无法利用很多数据局部性,因为在关键操作之间,中间结果会写回到主内存中。本文提出了一种优化的实施方案,该方案可产生更好的性能和更高的能源效率。我们的贡献是将内核求和的所有步骤融合到矩阵乘法代码结构中,并优化内存访问顺序以充分利用共享内存和缓存层次结构。我们将内核求和问题分解为具有很少依赖性的单个任务,并在更精细的并行性和减少的数据复制之间取得平衡。根据硬件特性,我们以交错方式将线程映射到矩阵元素,并重新放置矩阵元素以避免共享内存负载和存储库冲突。我们还应用了双重缓冲来隐藏内存访问延迟。与基于cuBLAS的实现相比,我们分析了融合核求和的性能和能耗优势。我们证明,在所有测试问题尺寸中,我们的方法在较小尺寸下的加速比最高可达1.8倍,并节省了总能量的33%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号