Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors

Nagasaka Yusuke; Matsuoka Satoshi; Azad Ariful; Buluc Aydin

首页> 外文期刊>Parallel Computing >Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors

【24h】

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors

机译：对多核和多核处理器上的稀疏矩阵产品进行性能优化，建模和分析

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. We build the performance model for hash-table and heap-based algorithms, which supports the recipe. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix. Finally, we integrate our implementations into a large-scale protein clustering code named HipMCL, accelerating its SpGEMM kernel by up to 10X and achieving an overall performance boost for the whole HipMCL application by 2.6X. (C) 2019 Elsevier B.V. All rights reserved.

机译：稀疏矩阵矩阵乘法（SpGEMM）是一种计算基元，已广泛用于从传统数值应用到最近的大数据分析和机器学习的领域。尽管已经提出了许多SpGEMM算法，但仍缺乏针对多核和多核处理器的针对硬件的优化，并且无法在各种用例和矩阵下对其性能进行详细分析。我们首先通过Intel Xeon Phi（骑士登陆或KNL）上的内存管理和线程调度来识别和缓解多个瓶颈。我们专门针对多核处理器，开发了基于哈希表的算法，并优化了基于堆的共享内存SpGEMM算法。我们将检查它们的性能以及其他可公开获得的代码。与文献不同，我们的评估还包括代表真实图形算法的用例，例如多源广度优先搜索或三角计数。在大多数情况下，我们的哈希表和基于堆的算法显示出库的显着加速，而其他算法则以不同的矩阵大小，稀疏性，压缩因子和操作类型主导了其他场景。我们总结了深入的评估结果，并制定了针对目标场景的最佳SpGEMM算法的配方。我们为哈希表和基于堆的算法构建性能模型，该模型支持配方。一个关键的发现是，如果不需要在输出矩阵的每一行中对非零进行排序，则基于哈希表的SpGEMM可以显着提高性能。最后，我们将实现集成到名为HipMCL的大规模蛋白质集群代码中，将其SpGEMM内核加速了多达10倍，并使整个HipMCL应用程序的整体性能提高了2.6倍。（C）2019 Elsevier B.V.保留所有权利。

著录项

来源
《Parallel Computing》 |2019年第12期|102545.1-102545.13|共13页
作者
Nagasaka Yusuke; Matsuoka Satoshi; Azad Ariful; Buluc Aydin;
展开▼
作者单位

Tokyo Inst Technol Tokyo Japan;

RIKEN Ctr Computat Sci Kobe Hyogo Japan|Tokyo Inst Technol Dept Math & Comp Sci Tokyo Japan;

Indiana Univ Bloomington IN USA;

Lawrence Berkeley Natl Lab Berkeley CA USA|Univ Calif Berkeley Dept Elect Engn & Comp Sci Berkeley CA USA;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Sparse matrix; SpGEMM; Intel KNL;

机译：稀疏矩阵;SpGEMM;英特尔KNL;

相似文献

外文文献
中文文献
专利

1. Performance analysis of the Kahan-enhanced scalar product onrncurrent multi-core and many-core processors [J] . Johannes Hofmann, Dietmar Fey, Michael Riedmann, Concurrency and computation: practice and experience . 2017,第9期

机译：Kahan增强的标量产品在当前多核和多核处理器上的性能分析
2. Performance analysis of the Kahan-enhanced scalar product oncurrent multi-core and many-core processors [J] . Johannes Hofmann, Dietmar Fey, Michael Riedmann, Concurrency and computation: practice and experience . 2017,第9期

机译：Kahan增强标量产品的性能分析目前的多核和许多核心处理器
3. Performance modeling and optimization of parallel LU-SGS on many-core processors for 3D high-order CFD simulations [J] . Li Dali, Xu Chuanfu, Cheng Bin, Journal of supercomputing . 2017,第6期

机译：用于3D高阶CFD仿真的多核处理器上的并行LU-SGS性能建模和优化
4. Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors [C] . Athena Elafrou, Georgios Goumas, Nectarios Koziris International Conference on Parallel Processing . 2017

机译：现代和多核处理器上稀疏矩阵矢量乘法的性能分析与优化
5. Compile-time and run-time optimizations for enhancing locality and parallelism on multi-core and many-core systems. [D] . Baskaran, Muthu Manikandan. 2009

机译：编译时和运行时优化，用于增强多核和多核系统上的局部性和并行性。
6. Analysis and Optimization based on Reusable Knowledge Base of Process Performance Models [O] . Alexander Brodsky, Guodong Shao, Mohan Krishnamoorthy, -1

机译：基于过程绩效模型可重用知识库的分析与优化
7. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors [O] . Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, 2019

机译：多核和多核处理器稀疏矩阵矩阵产品的性能优化，建模和分析

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅