OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator

机译：OuterSPACE：基于外部产品的稀疏矩阵乘法加速器

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm².

机译：稀疏矩阵广泛用于图形和数据分析，机器学习，工程和科学应用中。本文描述并分析了OuterSPACE，这是一种针对涉及大型稀疏矩阵的应用程序的加速器。 OuterSPACE是一种高度可扩展，高能效，可重新配置的设计，由大规模并行单程序，多数据（SPMD）样式的处理单元，分布式内存，高速交叉开关和高带宽内存（HBM）组成。我们将对非零的冗余内存访问识别为传统稀疏矩阵矩阵乘法算法中的关键瓶颈。为了改善这一点，我们实现了一种基于外部乘积的矩阵乘法技术，该技术通过将乘法与累积解耦来消除冗余访问。我们证明了传统架构由于其内存层次结构的限制以及在算法中利用并行性的能力，因此无法利用这种减少而不会产生大量开销。 OuterSPACE旨在专门克服这些挑战。我们在来自佛罗里达大学SuiteSparse馆藏和斯坦福网络分析项目的各种矩阵上使用gem5仿真了我们架构的关键组件，并显示了至强CPU上Intel Math Kernel Library的平均加速为7.9倍，而Xeon CPU则为13.0倍。在NVIDIA K40 GPU上运行时，cuSPARSE和14.0x相对于CUSP，同时在24 W功率预算中在87 mm区域内实现了2.9 GFLOPS的平均吞吐量 ^{2
。}

著录项

来源
《IEEE International Symposium on High Performance Computer Architecture》|2018年|724-736|共13页
会议地点
作者
Subhankar Pal; Jonathan Beaumont; Dong-Hyeon Park; Aporva Amarnath; Siying Feng; Chaitali Chakrabarti; Hun-Seok Kim; David Blaauw; Trevor Mudge; Ronald Dreslinski;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Sparse matrices; Computer architecture; Kernel; Matrix decomposition; Parallel processing; Graphics processing units; Libraries;

机译：稀疏矩阵;计算机体系结构;内核;矩阵分解;并行处理;图形处理单元;库;

相似文献

外文文献
中文文献
专利

1. SIMULTANEOUS INPUT AND OUTPUT MATRIX PARTITIONING FOR OUTER-PRODUCT-PARALLEL SPARSE MATRIX-MATRIX MULTIPLICATION [J] . Akbudak Kadir, Aykanat Cevdet SIAM Journal on Scientific Computing . 2014,第5期

机译：外部产品并行稀疏矩阵-矩阵乘法的同时输入和输出矩阵划分
2. A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix–Matrix Multiplication Accelerator [J] . Park Dong-Hyeon, Pal Subhankar, Peng Siying, IEEE Journal of Solid-State Circuits . 2020,第4期

机译：A 7.3 M输出非零/ j，11.7 m输出非零/ GB可重新配置稀疏矩阵矩阵乘法加速器
3. SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator [J] . Wu Di, Fan Xitian, Cao Wei, IEEE transactions on very large scale integration (VLSI) systems . 2021,第5期

机译：SWM：高性能稀疏 - Winograd矩阵乘法CNN加速器
4. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator [C] . Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, IEEE International Symposium on High Performance Computer Architecture . 2018

机译：超空间：基于外部产品的稀疏矩阵乘法加速器
5. Fast space-varying convolution in stray light reduction, fast matrix vector multiplication using the sparse matrix transform, and activation detection in fMRI data analysis. [D] . Wei, Jianing. 2010

机译：快速减少杂散光的空间变化卷积，使用稀疏矩阵变换的快速矩阵向量乘法以及fMRI数据分析中的激活检测。
6. Computing the sparse matrix vector product using block-based kernels without zero padding on processors with AVX-512 instructions [O] . Bérenger Bramas, Pavel Kus 2018

机译：使用AVX-512指令的处理器上没有零填充的基于块的内核计算稀疏矩阵矢量产品
7. Simultaneous input and output matrix partitioning for outer-product-parallel sparse matrix-matrix multiplication [O] . Akbudak, K., Aykanat, C. 2014

机译：外部乘积并行稀疏矩阵矩阵乘法的同时输入和输出矩阵划分

OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator

摘要

著录项

相似文献

相关主题

期刊订阅