首页> 外文会议>Principles and practice of parallel programming >Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs

【24h】

Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs

机译：GPU上的模型驱动的稀疏矩阵矢量乘法自动调谐

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts.First, we describe several carefully hand-tuned SpMV imple mentations for GPUs, identifying key GPU-specific performance limitations, enhancements, and tuning opportunities. These implementations, which include variants on classical blocked compressed sparse row (BCSR) and blocked ELLPACK (BELLPACK) storage formats, match or exceed state-of-the-art implementations. For instance, our best BELLPACK implementation achieves up to 29.0 Gflop/s in single-precision and 15.7 Gflop/s in double-precision on the NVIDIA T10P multiprocessor (C1060), enhancing prior state-of-the-art unblocked implementations (Bell and Garland, 2009) by up to 1.8 × and 1.5× for single- and double-precision respectively.However, achieving this level of performance requires input matrix-dependent parameter tuning. Thus, in the second part of this study, we develop a performance model that can guide tuning. Like prior autotuning models for CPUs (e.g., Im, Yelick, and Vuduc, 2004), this model requires offline measurements and run-time estimation, but more directly models the structure of multithreaded vector processors like GPUs. We show that our model can identify the implementations that achieve within 15% of those found through exhaustive search.

机译：我们提出了一种性能模型驱动的框架，用于在图形处理单元（GPU）加速的系统上对稀疏矩阵矢量乘法（SpMV）进行自动性能调整（自动调整）。我们的研究包括两个部分。首先，我们描述了针对GPU的几种经过精心手动调整的SpMV实现，确定了特定于GPU的关键性能限制，增强和调整机会。这些实现（包括经典阻塞压缩稀疏行（BCSR）和阻塞ELLPACK（BELLPACK）存储格式的变体）匹配或超过了最新的实现。例如，我们最好的BELLPACK实现在NVIDIA T10P多处理器（C1060）上单精度达到29.0 Gflop / s，双精度达到15.7 Gflop / s，从而增强了现有技术水平Garland，2009年），单精度和双精度分别提高了1.8倍和1.5倍。但是，要达到这一性能水平，需要调整与输入矩阵有关的参数。因此，在本研究的第二部分中，我们开发了可指导调优的性能模型。与先前的CPU自动调整模型一样（例如Im，Yelick和Vuduc，2004），该模型需要离线测量和运行时估计，但更直接地对GPU等多线程矢量处理器的结构进行建模。我们表明，我们的模型可以识别出通过穷举搜索找到的实现率不到15％的实现。

著录项

来源
《Principles and practice of parallel programming 》|2010年|P.115-125|共11页
会议地点
作者
Jee W. Choi; Amik Singh; Richard W. Vuduc;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机软件 ;
关键词
GPU; sparse matrix-vector multiplication; performance modeling;

机译：GPU;稀疏矩阵-向量乘法;绩效建模;

相似文献

外文文献
中文文献
专利

1. Model-driven autotuning of sparse matrix-vector multiply on GPUs [J] . Choi Jee W., Singh Amik, Vuduc Richard W. ACM SIGPLAN Notices: A Monthly Publication of the Special Interest Group on Programming Languages . 2010 ,第5期

机译：GPU上的模型驱动的稀疏矩阵矢量乘法自动调谐
2. A model-driven blocking strategy for load balanced sparse matrix-vector multiplication on GPUs [J] . Arash Ashari, Naser Sedaghati, John Eisenlohr, Journal of Parallel and Distributed Computing . 2015 ,第feba期

机译：GPU上负载均衡的稀疏矩阵矢量乘法的模型驱动的阻塞策略
3. Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs [J] . Rafique A., Constantinides G.A., Kapre N. Parallel and Distributed Systems, IEEE Transactions on . 2015 ,第1期

机译：GPU和FPGA上的迭代稀疏矩阵向量乘法的通信优化
4. Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs [C] . Jee W. Choi, Amik Singh, Richard W. Vuduc ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . 2010

机译：GPU上稀疏矩阵矢量的模型驱动自动传递
5. Autotuning, code generation and optimizing compiler technology for gpus. [D] . Khan, Malik Muhammad Zaki Murtaza. 2012

机译：自动调整，代码生成并优化GPU的编译器技术。
6. A Fast Spatial Clustering Method for Sparse LiDAR Point Clouds Using GPU Programming [O] . Yifei Tian, Wei Song, Long Chen, 2020

机译：使用GPU编程的稀疏LiDAR点云的快速空间聚类方法
7. Model-driven autotuning of sparse matrix-vector multiply on GPUs [O] . Jee W. Choi, Amik Singh, Richard W. Vuduc 2010

机译：GPU上的模型驱动的稀疏矩阵矢量乘法自动调整

Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs

摘要

著录项

相似文献

相关主题

期刊订阅