Cache-oblivious matrix algorithms in the age of multicores and many cores

Alexander Heinecke; Carsten Trinitis

首页> 外文期刊>Concurrency and computation: practice and experience >Cache-oblivious matrix algorithms in the age of multicores and many cores

【24h】

Cache-oblivious matrix algorithms in the age of multicores and many cores

机译：多核和多核时代的高速缓存可忽略矩阵算法

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

This article highlights the issue of upcoming wider single-instruction, multiple-data units as well as steadilyrnincreasing core counts on contemporary and future processor architectures.We present the recent port to andrnlatest results of cache-oblivious algorithms and implementations of our TifaMMy code on four architectures:rnSGI’s UltraViolet distributed shared-memory machine, Intel’s latest x86 architecture code-named SandyrnBridge, AMD’s new Bulldozer architecture, and Intel’s future Many Integrated Core architecture. TifaMMy’srnmatrix multiplication and LU decomposition routines have been adapted and tuned with regard to thesernarchitectures. Results are discussed and compared with vendors’ architecture-specific and optimizedrnlibraries, Math Kernel Library and AMD Core Math Library, for both a standard C++ version withrnvectorization compiler switches and TifaMMy’s highly optimized vector intrinsics version. We providerninsights into architectural properties and comment on the feasibility of heterogeneous cores and accelerators,rnnamely graphics processing units. Besides bare-metal performance, the test platforms’ ease of use isrnanalyzed in detail, and the portability of our approach to new and upcoming silicon is discussed with regardrnto required effort on code change abstraction levels.rnAs a result, we demonstrate that because of its generic structure in terms of memory organization,rnTifaMMy executes with equally efficient performance on all four architectures as it automatically adaptsrnitself to architectural parameters without losing performance against the Math Kernel Library and AMDrnCore Math Library, underlining its generic and cache-oblivious properties, as the porting effort was relativelyrnlow compared with that in other implementations.

机译：本文重点介绍了即将出现的更广泛的单指令，多数据单元以及不断增加的当代和未来处理器体系结构上的核心数量的问题。架构：SGI的UltraViolet分布式共享内存计算机，代号为SandyrnBridge的英特尔最新x86架构，AMD的新Bulldozer架构以及英特尔未来的Many Integrated Core架构。 TifaMMy的srnmatrix乘法和LU分解例程已针对其体系结构进行了调整和调整。对结果进行了讨论，并与供应商的特定于体系结构的优化库，数学内核库和AMD核心数学库进行了比较，这两种方法均适用于带有向量化编译器开关的标准C ++版本和TifaMMy高度优化的向量内在版本。我们提供对建筑属性的见解，并评论异构内核和加速器（即图形处理单元）的可行性。除了裸机性能之外，还对测试平台的易用性进行了详细分析，并讨论了我们在新的和即将推出的芯片上的方法的可移植性，并考虑了代码更改抽象级别方面的工作量。就内存组织而言，通用的结构，rnTifaMMy在所有四种架构上均能以同样高效的性能执行，因为它可以自动适应架构参数，而不会因Math Kernel Library和AMDrnCore Math Library而失去性能，强调了其通用和缓存无关的特性，如移植与其他实现相比，工作量相对较小。

著录项

来源
《Concurrency and computation: practice and experience》 |2015年第9期|2215-2234|共20页
作者
Alexander Heinecke; Carsten Trinitis;
展开▼
作者单位

Institut für Informatik, Technische Universität München, D-85748 Garching bei München, Germany;

Institut für Informatik, Technische Universität München, D-85748 Garching bei München, Germany;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
shared-memory platforms; cache oblivious; block recursive; linear algebra; performance; parallelization;

机译：共享内存平台;缓存遗忘;块递归线性代数性能;并行化;

相似文献

外文文献
中文文献
专利

1. Performance analysis and optimization of parallel Best-First Search algorithms on multicore and cluster of multicore [J] . Victoria M. Sanz Journal of Computer Science and Technology . 2016,第1期

机译：多核和多核集群上并行最佳优先搜索算法的性能分析和优化
2. Performance Optimization of Tridiagonal Matrix Algorithm [TDMA] on Multicore Architectures: Computational Framework and Mathematical Modelling [J] . Anishchandran Chathalingath, Arun Manoharan International journal of grid and high performance computing . 2019,第4期

机译：三角形矩阵算法[TDMA]在多核架构上的性能优化：计算框架和数学建模
3. An Improved Distance Matrix Computation Algorithm for Multicore Clusters [J] . Mohammed W. Al-Neama, Naglaa M. Reda, Fayed F. M. Ghaleb BioMed research international . 2014,第32期

机译：一种改进的多核集群距离矩阵计算算法
4. Performance and Power Characteristics of Matrix Multiplication Algorithms on Multicore and Shared Memory Machines [C] . Yan Yonghong, Kemp Jeremy, Tian Xiaonan, 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis. . 2012

机译：多核和共享存储机器上矩阵乘法算法的性能和功率特性
5. Performance Optimization for Sparse Matrix Factorization Algorithms on Hybrid Multicore Architectures [D] . Tang, Meng. 2020

机译：混合多核架构上稀疏矩阵分解算法的性能优化
6. An Improved Distance Matrix Computation Algorithm for Multicore Clusters [O] . Mohammed W. Al-Neama, Naglaa M. Reda, Fayed F. M. Ghaleb -1

机译：一种改进的多核集群距离矩阵计算算法
7. Hybrid algorithms for efficient Cholesky decomposition and matrix inverse using multicore CPUs with GPU accelerators [O] . Macindoe GI 2013

机译：使用具有GPU加速器的多核CPU进行高效Cholesky分解和矩阵逆的混合算法

Cache-oblivious matrix algorithms in the age of multicores and many cores

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅