首页> 外文期刊>IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems >Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures
【24h】

Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures

机译:多核体系结构上的无标度稀疏矩阵矢量乘法

获取原文
获取原文并翻译 | 示例

摘要

Sparse matrix-vector multiplication (SpMV) is one of the most important kernels for many applications. In this paper, we study the implementation of SpMV for scale-free matrices on many-core architectures including graphic processing units and Xeon Phi coprocessors. We first propose a hardware oblivious implementation for heterogeneous many-core processors using OpenCL. Our OpenCL implementation uses a novel SpMV format called hybrid COO+CSR (HCC), which employs 2-D jagged partitioning to balance the workload among a large number of cores and improve the data locality. Moreover, the OpenCL implementation is designed to be parametric, which allows systematic performance tuning. We conduct experiments to evaluate the efficiency of our hardware oblivious implementation. Experiments show that it achieves comparable performance to the Intel MKL and state-of-the-art OpenCL-based ViennaCL library implementation. Although the OpenCL implementation provides functional portability for heterogeneous systems, it fails to take advantage of the low-level architectural features. To further improve the performance, we propose a hardware conscious implementation using the native parallel programming language. We use the Xeon Phi platform as a case study. In our hardware conscious implementation, we ensure that the HCC format efficiently utilizes the vector process units on Xeon Phi by employing low-level intrinsics, and improve the overall performance through locality-aware block mapping, and intrablock tiling. Experiments using a wide range of representative scale-free matrices demonstrate that compared with the OpenCL-based hardware oblivious implementation, the hardware conscious implementation achieves 2.2× speedup on average. Compared with MKL, the hardware conscious implementation achieves 3.1× speedup on Xeon Phi.
机译:稀疏矩阵向量乘法(SpMV)是许多应用程序中最重要的内核之一。在本文中,我们研究了SpMV在包括图形处理单元和Xeon Phi协处理器在内的多核体系结构上用于无标度矩阵的实现。我们首先为使用OpenCL的异构多核处理器提出了一种硬件遗忘的实现。我们的OpenCL实施使用一种新颖的SpMV格式,称为混合COO + CSR(HCC),该格式使用2-D锯齿状分区来平衡大量内核之间的工作量并改善数据局部性。而且,OpenCL实现被设计为参数化的,从而可以进行系统的性能调整。我们进行实验以评估我们的硬件遗忘实现的效率。实验表明,它可以达到与英特尔MKL和基于OpenCL的最先进的ViennaCL库实现相当的性能。尽管OpenCL实施为异构系统提供了功能上的可移植性,但是它无法利用低级体系结构功能。为了进一步提高性能,我们提出了使用本地并行编程语言的硬件意识实现。我们使用至强融核平台作为案例研究。在我们注重硬件的实现中,我们确保HCC格式通过使用低级内在函数来有效利用Xeon Phi上的矢量处理单元,并通过局部性块映射和块内切片来提高整体性能。使用各种有代表性的无标度矩阵进行的实验表明,与基于OpenCL的硬件遗忘实现相比,硬件意识实现平均可实现2.2倍的加速。与MKL相比,硬件意识实现在Xeon Phi上实现了3.1倍的加速。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号