Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures

Yun Liang; Wai Teng Tang; Ruizhe Zhao; Mian Lu; Huynh Phung Huynh; Rick Siow Mong Goh

首页> 外文期刊>IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems >Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures

【24h】

Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures

机译：多核体系结构上的无标度稀疏矩阵矢量乘法

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Sparse matrix-vector multiplication (SpMV) is one of the most important kernels for many applications. In this paper, we study the implementation of SpMV for scale-free matrices on many-core architectures including graphic processing units and Xeon Phi coprocessors. We first propose a hardware oblivious implementation for heterogeneous many-core processors using OpenCL. Our OpenCL implementation uses a novel SpMV format called hybrid COO+CSR (HCC), which employs 2-D jagged partitioning to balance the workload among a large number of cores and improve the data locality. Moreover, the OpenCL implementation is designed to be parametric, which allows systematic performance tuning. We conduct experiments to evaluate the efficiency of our hardware oblivious implementation. Experiments show that it achieves comparable performance to the Intel MKL and state-of-the-art OpenCL-based ViennaCL library implementation. Although the OpenCL implementation provides functional portability for heterogeneous systems, it fails to take advantage of the low-level architectural features. To further improve the performance, we propose a hardware conscious implementation using the native parallel programming language. We use the Xeon Phi platform as a case study. In our hardware conscious implementation, we ensure that the HCC format efficiently utilizes the vector process units on Xeon Phi by employing low-level intrinsics, and improve the overall performance through locality-aware block mapping, and intrablock tiling. Experiments using a wide range of representative scale-free matrices demonstrate that compared with the OpenCL-based hardware oblivious implementation, the hardware conscious implementation achieves 2.2× speedup on average. Compared with MKL, the hardware conscious implementation achieves 3.1× speedup on Xeon Phi.

机译：稀疏矩阵向量乘法（SpMV）是许多应用程序中最重要的内核之一。在本文中，我们研究了SpMV在包括图形处理单元和Xeon Phi协处理器在内的多核体系结构上用于无标度矩阵的实现。我们首先为使用OpenCL的异构多核处理器提出了一种硬件遗忘的实现。我们的OpenCL实施使用一种新颖的SpMV格式，称为混合COO + CSR（HCC），该格式使用2-D锯齿状分区来平衡大量内核之间的工作量并改善数据局部性。而且，OpenCL实现被设计为参数化的，从而可以进行系统的性能调整。我们进行实验以评估我们的硬件遗忘实现的效率。实验表明，它可以达到与英特尔MKL和基于OpenCL的最先进的ViennaCL库实现相当的性能。尽管OpenCL实施为异构系统提供了功能上的可移植性，但是它无法利用低级体系结构功能。为了进一步提高性能，我们提出了使用本地并行编程语言的硬件意识实现。我们使用至强融核平台作为案例研究。在我们注重硬件的实现中，我们确保HCC格式通过使用低级内在函数来有效利用Xeon Phi上的矢量处理单元，并通过局部性块映射和块内切片来提高整体性能。使用各种有代表性的无标度矩阵进行的实验表明，与基于OpenCL的硬件遗忘实现相比，硬件意识实现平均可实现2.2倍的加速。与MKL相比，硬件意识实现在Xeon Phi上实现了3.1倍的加速。

著录项

来源
《IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems》 |2017年第12期|2106-2119|共14页
作者
Yun Liang; Wai Teng Tang; Ruizhe Zhao; Mian Lu; Huynh Phung Huynh; Rick Siow Mong Goh;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Computer architecture; Sparse matrices; Graphics processing units; Programming; Instruction sets; Optimization;

机译：计算机体系结构;稀疏矩阵;图形处理单元;编程;指令集;优化;

相似文献

外文文献
中文文献
专利

1. Spatiotemporal Graph and Hypergraph Partitioning Models for Sparse Matrix-Vector Multiplication on Many-Core Architectures [J] . Abubaker Nabil, Akbudak Kadir, Aykanat Cevdet IEEE Transactions on Parallel and Distributed Systems . 2019,第2期

机译：多核架构上稀疏矩阵-向量乘法的时空图和超图划分模型
2. Optimizing Sparse Matrix-Vector Multiplications on an ARMv8-based Many-Core Architecture [J] . Chen Donglin, Fang Jianbin, Chen Shizhao, International journal of parallel programming . 2019,第3期

机译：在基于ARMv8的多核体系结构上优化稀疏矩阵向量乘法
3. Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Processors [J] . M. Ozan Karsavuran, Kadir Akbudak, Cevdet Aykanat IEEE Transactions on Parallel and Distributed Systems . 2016,第6期

机译：多核处理器上的局部性并行稀疏矩阵向量和矩阵转置向量乘法
4. Research on Performance Optimization for Sparse Matrix-Vector Multiplication in Multi/Many-core Architecture [C] . Qihan Wang, Mingliang Li, Jianming Pang, International Conference on Information Technology and Computer Application . 2020

机译：多/多核架构中稀疏矩阵矢量乘法性能优化研究
5. Analysis of High Performance Sparse Matrix-Vector Multiplication for Small Finite Fields [D] . Lambert, Matthew A. 2020

机译：小型有限字段高性能稀疏矩阵矢量乘法分析
6. HIERARCHICAL ORTHOGONAL MATRIX GENERATION AND MATRIX-VECTOR MULTIPLICATIONS IN RIGID BODY SIMULATIONS [O] . FUHUI FANG, JINGFANG HUANG, GARY HUBER, -1

机译：刚体模拟中的正交正交矩阵生成和矩阵向量乘法
7. Adaptive Optimization of Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures [O] . Shizhao Chen, Jianbin Fang, Donglin Chen, 2018

机译：稀疏矩阵矢量乘法对新兴多核架构的自适应优化

Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅