Effective Implementation of Matrix-Vector Multiplication on Intel's AVX multicore Processor

Hassan Somaia A.; Mahmoud Mountasser M. M.; Hemeida A. M.; Saber Mahmoud A.

首页> 外文期刊>Computer Languages, Systems & Structures >Effective Implementation of Matrix-Vector Multiplication on Intel's AVX multicore Processor

【24h】

Effective Implementation of Matrix-Vector Multiplication on Intel's AVX multicore Processor

机译：英特尔AVX多核处理器上矩阵矢量乘法的有效实现

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Matrix-vector multiplication kernel is one of the most important and common computational operations which form the core of varied important application areas such as scientific and engineering applications. Therefore, it is substantial to optimize and accelerate its implementation. This paper proposes an optimized algorithm for single-precision matrix vector multiplication (SGEMV) on the Intel core i7 processor. An overview of the Intel's advanced vector extension instructions in implementing dense matrix-vector multiplication kernels in parallel has been comprehensively addressed. Also, a variety of performance optimization techniques using Intel's advanced vector extension (AVX) instruction sets, memory access optimization, and OpenMP parallelization has been designed. Additionally, the performance of the proposed algorithms is evaluated in compared to the latest version of Intel Math Kernel Library SGEMV 2017 subroutines because Intel Math Kernel Library subroutines also consider the same optimization methods that are used in this paper. In this paper, we have introduced an overview of the optimization techniques, have explained the specific details of handling them in the proposed algorithm, and also have showed the advantages and the challenges of combining them together in contrast to the previous works which usually have concentrated on a single technique and the performance achieved by it. The guidelines of parallel implementation of the proposed algorithm and the characteristics of the target architecture that should be considered when implementing this algorithm have been investigated. An overview of the Intel's advanced vector extension instructions in implementing dense matrix-vector multiplication kernels in parallel has been comprehensively addressed. A comparative study of the two most popularly used C++ compilers: Intel C++ compiler 17.0 in Intel Parallel Studio XE 2017 against Microsoft Visual Studio C++ compiler 2015 has been investigated. Finally, the comparison between two primary ways of utilizing AVX instructions: inline assembly and intrinsic functions, and the comparison between single-core and multi-core platforms have introduced. The results are evaluated in Intel Core i7-5600U processor of 2.6 GHz with 128KB L1 cache, 512 KB L2 cache, and 4MB L3 cache running on windows 10 operating system and on a Broadwell system. The obtained results of the proposed optimized algorithm are implemented on square matrices of different large sizes range from 1024 to 19456. The results indicate a performance improvement of 18.2% and 14.1% for (y = A. x) and (y = A(T). x) respectively in compared with the results which are obtained using the latest version of Intel Math Kernel Library 2017(SGEMV) subroutines on multi-core platform. (C) 2017 Elsevier Ltd. All rights reserved.

机译：矩阵矢量乘法内核是最重要且最常见的计算操作之一，它构成了各种重要应用领域（如科学和工程应用）的核心。因此，优化和加速其实施非常重要。本文针对Intel Core i7处理器的单精度矩阵矢量乘法（SGEMV）提出了一种优化算法。英特尔解决了在并行实现密集矩阵矢量乘法内核方面的高级矢量扩展指令的概述。此外，还设计了使用英特尔高级矢量扩展（AVX）指令集，内存访问优化和OpenMP并行化的各种性能优化技术。此外，与最新版本的Intel Math Kernel Library SGEMV 2017子例程相比，评估了拟议算法的性能，因为Intel Math Kernel Library子例程还考虑了与本文使用的相同优化方法。在本文中，我们对优化技术进行了概述，解释了所提出算法中处理这些优化技术的具体细节，并且还展示了与以往通常集中精力进行的工作相比，将它们组合在一起的优点和挑战。单一技术及其实现的性能。研究了所提出算法的并行实现准则以及实现该算法时应考虑的目标体系结构的特征。英特尔解决了在并行实现密集矩阵矢量乘法内核方面的高级矢量扩展指令的概述。已对两种最常用的C ++编译器进行了比较研究：调查了Intel Parallel Studio XE 2017中的Intel C ++编译器17.0与Microsoft Visual Studio C ++编译器2015。最后，介绍了使用AVX指令的两种主要方式之间的比较：内联汇编和内部函数，以及单核和多核平台之间的比较。在2.6 GHz的Intel Core i7-5600U处理器，运行于Windows 10操作系统和Broadwell系统上的128KB L1缓存，512 KB L2缓存和4MB L3缓存中评估了结果。所提出的优化算法的结果在1024至19456的不同大尺寸平方矩阵上实现。结果表明（y = A. x）和（y = A（T）的性能提高了18.2％和14.1％）。x）分别与在多核平台上使用最新版本的Intel Math Kernel Library 2017（SGEMV）子例程获得的结果进行比较。（C）2017 Elsevier Ltd.保留所有权利。

著录项

来源
《Computer Languages, Systems & Structures》 |2018年第1期|158-175|共18页
作者
Hassan Somaia A.; Mahmoud Mountasser M. M.; Hemeida A. M.; Saber Mahmoud A.;
展开▼
作者单位

Aswan Univ, Elect Engn Dept, Comp & Syst Sect, Aswan 81542, Egypt;

Aswan Univ, Elect Engn Dept, Comp & Syst Sect, Aswan 81542, Egypt;

Aswan Univ, Fac Energy Engn, Elect Engn Dept, Aswan 81825, Egypt;

Aswan Univ, Elect Engn Dept, Comp & Syst Sect, Aswan 81542, Egypt;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Intel's AVX; Matrix-vector multiplications; Intel MKL SGEMV; Performance optimization; Multicore, Intrinsic functions; Inline assembly; Intel C plus plus compiler; Microsoft VC plus plus compiler;

机译：英特尔的AVX;矩阵矢量乘法;英特尔MKL SGEMV;性能优化;多核固有功能;内联汇编;英特尔C加编译器;微软VC加编译器;

相似文献

外文文献
中文文献
专利

1. An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512 [J] . Roktaek Lim, Yeongha Lee, Raehyun Kim, Cluster computing . 2018,第4期

机译：具有AVX-512的Intel Knl处理器矩阵矩阵乘法的实现
2. A Novel Parallel Scan for Multicore Processors and Its Application in Sparse Matrix-Vector Multiplication [J] . Zhang Nan Parallel and Distributed Systems, IEEE Transactions on . 2012,第3期

机译：一种新颖的多核处理器并行扫描及其在稀疏矩阵矢量乘法中的应用
3. An effective implementation of Strassen’s algorithm using AVX intrinsics for a multicore architecture [J] . Nwe Zin Oo, Panyayot Chaikan Sonklanakarin Journal of Science and Technology . 2020,第6期

机译：利用AVX内在机构对多核架构的有效实现STRASSEN算法
4. An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units [C] . Abu-Sufah Walid, Karim Asma Abdel The 14th IEEE International Conference on High Performance Computing and Communication ; The 9th IEEE International Conference on Embedded Software and Systems. . 2012

机译：在图形处理单元上实现稀疏矩阵向量乘法的有效方法
5. A Scalable and Flexible Framework for Gaussian Processes via Matrix-Vector Multiplication [D] . Pleiss, Geoff. 2020

机译：通过矩阵矢量乘法可扩展和灵活的高斯过程框架
6. HIERARCHICAL ORTHOGONAL MATRIX GENERATION AND MATRIX-VECTOR MULTIPLICATIONS IN RIGID BODY SIMULATIONS [O] . FUHUI FANG, JINGFANG HUANG, GARY HUBER, -1

机译：刚体模拟中的正交正交矩阵生成和矩阵向量乘法
7. Application of Analytical Modeling of Matrix-Vector Multiplication on Multicore Processors [O] . 2020

机译：分析模拟矩阵矢量乘法对多核处理器的应用

Effective Implementation of Matrix-Vector Multiplication on Intel's AVX multicore Processor

摘要

著录项

相似文献

相关主题

期刊订阅