首页> 外文期刊>Concurrency and computation: practice and experience >Cache-oblivious matrix algorithms in the age of multicores and many cores
【24h】

Cache-oblivious matrix algorithms in the age of multicores and many cores

机译:多核和多核时代的高速缓存可忽略矩阵算法

获取原文
获取原文并翻译 | 示例

摘要

This article highlights the issue of upcoming wider single-instruction, multiple-data units as well as steadilyrnincreasing core counts on contemporary and future processor architectures.We present the recent port to andrnlatest results of cache-oblivious algorithms and implementations of our TifaMMy code on four architectures:rnSGI’s UltraViolet distributed shared-memory machine, Intel’s latest x86 architecture code-named SandyrnBridge, AMD’s new Bulldozer architecture, and Intel’s future Many Integrated Core architecture. TifaMMy’srnmatrix multiplication and LU decomposition routines have been adapted and tuned with regard to thesernarchitectures. Results are discussed and compared with vendors’ architecture-specific and optimizedrnlibraries, Math Kernel Library and AMD Core Math Library, for both a standard C++ version withrnvectorization compiler switches and TifaMMy’s highly optimized vector intrinsics version. We providerninsights into architectural properties and comment on the feasibility of heterogeneous cores and accelerators,rnnamely graphics processing units. Besides bare-metal performance, the test platforms’ ease of use isrnanalyzed in detail, and the portability of our approach to new and upcoming silicon is discussed with regardrnto required effort on code change abstraction levels.rnAs a result, we demonstrate that because of its generic structure in terms of memory organization,rnTifaMMy executes with equally efficient performance on all four architectures as it automatically adaptsrnitself to architectural parameters without losing performance against the Math Kernel Library and AMDrnCore Math Library, underlining its generic and cache-oblivious properties, as the porting effort was relativelyrnlow compared with that in other implementations.
机译:本文重点介绍了即将出现的更广泛的单指令,多数据单元以及不断增加的当代和未来处理器体系结构上的核心数量的问题。架构:SGI的UltraViolet分布式共享内存计算机,代号为SandyrnBridge的英特尔最新x86架构,AMD的新Bulldozer架构以及英特尔未来的Many Integrated Core架构。 TifaMMy的srnmatrix乘法和LU分解例程已针对其体系结构进行了调整和调整。对结果进行了讨论,并与供应商的特定于体系结构的优化库,数学内核库和AMD核心数学库进行了比较,这两种方法均适用于带有向量化编译器开关的标准C ++版本和TifaMMy高度优化的向量内在版本。我们提供对建筑属性的见解,并评论异构内核和加速器(即图形处理单元)的可行性。除了裸机性能之外,还对测试平台的易用性进行了详细分析,并讨论了我们在新的和即将推出的芯片上的方法的可移植性,并考虑了代码更改抽象级别方面的工作量。就内存组织而言,通用的结构,rnTifaMMy在所有四种架构上均能以同样高效的性能执行,因为它可以自动适应架构参数,而不会因Math Kernel Library和AMDrnCore Math Library而失去性能,强调了其通用和缓存无关的特性,如移植与其他实现相比,工作量相对较小。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号