首页> 外文会议>IEEE International Symposium on High Performance Computer Architecture >OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator
【24h】

OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator

机译:OuterSPACE:基于外部产品的稀疏矩阵乘法加速器

获取原文

摘要

Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.
机译:稀疏矩阵广泛用于图形和数据分析,机器学习,工程和科学应用中。本文描述并分析了OuterSPACE,这是一种针对涉及大型稀疏矩阵的应用程序的加速器。 OuterSPACE是一种高度可扩展,高能效,可重新配置的设计,由大规模并行单程序,多数据(SPMD)样式的处理单元,分布式内存,高速交叉开关和高带宽内存(HBM)组成。我们将对非零的冗余内存访问识别为传统稀疏矩阵矩阵乘法算法中的关键瓶颈。为了改善这一点,我们实现了一种基于外部乘积的矩阵乘法技术,该技术通过将乘法与累积解耦来消除冗余访问。我们证明了传统架构由于其内存层次结构的限制以及在算法中利用并行性的能力,因此无法利用这种减少而不会产生大量开销。 OuterSPACE旨在专门克服这些挑战。我们在来自佛罗里达大学SuiteSparse馆藏和斯坦福网络分析项目的各种矩阵上使用gem5仿真了我们架构的关键组件,并显示了至强CPU上Intel Math Kernel Library的平均加速为7.9倍,而Xeon CPU则为13.0倍。在NVIDIA K40 GPU上运行时,cuSPARSE和14.0x相对于CUSP,同时在24 W功率预算中在87 mm区域内实现了2.9 GFLOPS的平均吞吐量 2

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号