首页> 外文会议>Principles and practice of parallel programming >Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs
【24h】

Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs

机译:GPU上的模型驱动的稀疏矩阵矢量乘法自动调谐

获取原文

摘要

We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts.First, we describe several carefully hand-tuned SpMV imple mentations for GPUs, identifying key GPU-specific performance limitations, enhancements, and tuning opportunities. These implementations, which include variants on classical blocked compressed sparse row (BCSR) and blocked ELLPACK (BELLPACK) storage formats, match or exceed state-of-the-art implementations. For instance, our best BELLPACK implementation achieves up to 29.0 Gflop/s in single-precision and 15.7 Gflop/s in double-precision on the NVIDIA T10P multiprocessor (C1060), enhancing prior state-of-the-art unblocked implementations (Bell and Garland, 2009) by up to 1.8 × and 1.5× for single- and double-precision respectively.However, achieving this level of performance requires input matrix-dependent parameter tuning. Thus, in the second part of this study, we develop a performance model that can guide tuning. Like prior autotuning models for CPUs (e.g., Im, Yelick, and Vuduc, 2004), this model requires offline measurements and run-time estimation, but more directly models the structure of multithreaded vector processors like GPUs. We show that our model can identify the implementations that achieve within 15% of those found through exhaustive search.
机译:我们提出了一种性能模型驱动的框架,用于在图形处理单元(GPU)加速的系统上对稀疏矩阵矢量乘法(SpMV)进行自动性能调整(自动调整)。我们的研究包括两个部分。 首先,我们描述了针对GPU的几种经过精心手动调整的SpMV实现,确定了特定于GPU的关键性能限制,增强和调整机会。这些实现(包括经典阻塞压缩稀疏行(BCSR)和阻塞ELLPACK(BELLPACK)存储格式的变体)匹配或超过了最新的实现。例如,我们最好的BELLPACK实现在NVIDIA T10P多处理器(C1060)上单精度达到29.0 Gflop / s,双精度达到15.7 Gflop / s,从而增强了现有技术水平Garland,2009年),单精度和双精度分别提高了1.8倍和1.5倍。 但是,要达到这一性能水平,需要调整与输入矩阵有关的参数。因此,在本研究的第二部分中,我们开发了可指导调优的性能模型。与先前的CPU自动调整模型一样(例如Im,Yelick和Vuduc,2004),该模型需要离线测量和运行时估计,但更直接地对GPU等多线程矢量处理器的结构进行建模。我们表明,我们的模型可以识别出通过穷举搜索找到的实现率不到15%的实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号