Optimizing GPGPU Kernel Summation for Performance and Energy Efficiency

机译：优化GPGPU内核汇总以提高性能和能源效率

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Kernel summation is a widely used computational kernel that involves matrix-matrix multiplication (GEMM) and matrix-vector multiplication (GEMV) computational primitives. The parallelism exhibited in kernel summation suggests performance improvement when running on GPGPU. State of the art GPU solutions apply cuBLAS library but cannot exploit much of the data locality because intermediate results are written back to main memory in between key operations. This paper presents an optimized implementation that yields better performance and high energy efficiency. Our contributions are fusing all steps of kernel summation into the matrix multiplication code structure and optimizing memory access ordering to make good use of shared memory and cache hierarchy. We decompose the kernel summation problem into individual tasks with few dependencies and strike a balance between finer grained parallelism and reduced data replication. Based on hardware characteristics, we map threads to matrix elements in an interleaved way, and reposition matrix elements to avoid shared memory load and store bank conflicts. We also apply double buffering to hide memory access latency. We analyze both performance and energy benefits of our fused kernel summation compared with the implementation based on cuBLAS. We show that in low dimensions our approach achieves a speedup of up to 1.8X, and saves up to 33% of total energy in all tested problem sizes.

机译：内核求和是一种广泛使用的计算内核，涉及矩阵矩阵乘法（GEMM）和矩阵向量乘法（GEMV）计算原语。内核求和中表现出的并行性表明在GPGPU上运行时的性能提高。最先进的GPU解决方案采用了cuBLAS库，但无法利用很多数据局部性，因为在关键操作之间，中间结果会写回到主内存中。本文提出了一种优化的实施方案，该方案可产生更好的性能和更高的能源效率。我们的贡献是将内核求和的所有步骤融合到矩阵乘法代码结构中，并优化内存访问顺序以充分利用共享内存和缓存层次结构。我们将内核求和问题分解为具有很少依赖性的单个任务，并在更精细的并行性和减少的数据复制之间取得平衡。根据硬件特性，我们以交错方式将线程映射到矩阵元素，并重新放置矩阵元素以避免共享内存负载和存储库冲突。我们还应用了双重缓冲来隐藏内存访问延迟。与基于cuBLAS的实现相比，我们分析了融合核求和的性能和能耗优势。我们证明，在所有测试问题尺寸中，我们的方法在较小尺寸下的加速比最高可达1.8倍，并节省了总能量的33％。

著录项

来源
《International Conference on Parallel Processing Workshops》|2016年|123-132|共10页
会议地点
作者
Jiajun Wang; Ahmed Khawaja; George Biros; Andreas Gerstlauer; Lizy K. John;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Kernel; Libraries; Instruction sets; Graphics processing units; Random access memory; Memory management;

机译：内核;库;指令集;图形处理单元;随机存取存储器;存储器管理;

相似文献

外文文献
中文文献
专利

1. Improving the Performance and Energy Efficiency of GPGPU Computing through Integrated Adaptive Cache Management [J] . Kim Kyu Yeun, Park Jinsu, Baek Woongki IEEE Transactions on Parallel and Distributed Systems . 2019,第3期

机译：通过集成的自适应缓存管理提高GPGPU计算的性能和能效
2. Quantifying the performance and energy efficiency of advanced cache indexing for GPGPU computing [J] . Chris Lupo Computing reviews . 2016,第12期

机译：量化用于GPGPU计算的高级缓存索引的性能和能效
3. Quantifying the performance and energy efficiency of advanced cache indexing for GPGPU computing [J] . Kim Kyu Yeun, Baek Woongki Microprocessors and microsystems . 2016,第JUNa期

机译：量化用于GPGPU计算的高级缓存索引的性能和能效
4. Optimizing GPGPU Kernel Summation for Performance and Energy Efficiency [C] . Jiajun Wang, Ahmed Khawaja, George Biros, International Workshop on Embedded Multicore Systems . 2016

机译：优化GPGPU内核求和性能和能源效率
5. Host and network optimizations for performance enhancement and energy efficiency in data center networks. [D] . Jin, Hao. 2012

机译：主机和网络优化，以提高数据中心网络的性能和能效。
6. Statistical-QoS Guaranteed Energy Efficiency Optimization for Energy Harvesting Wireless Sensor Networks [O] . Ya Gao, Wenchi Cheng, Hailin Zhang 2017

机译：能量收集无线传感器网络的统计QoS保证能效优化
7. Improving GPGPU Energy-Efficiency through Concurrent Kernel Execution and DVFS [O] . Qing Jiao, Mian Lu, Huynh Phung, 2015

机译：通过并发内核执行和DVFs提高GpGpU的能效

Optimizing GPGPU Kernel Summation for Performance and Energy Efficiency

摘要

著录项

相似文献

相关主题

期刊订阅