首页> 外文期刊>Concurrency and Computation >A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels
【24h】

A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels

机译:基于分层抽象,算法和优化的低级内核的高性能矩阵乘法框架

获取原文
获取原文并翻译 | 示例

摘要

Despite extensive research, optimal performance has not easily been available previously for matrix multiplication (especially for large matrices) on most architectures because of the lack of a structured approach and the limitations imposed by matrix storage formats. A simple but effective framework is presented here that lays the foundation for building high-performance matrix-multiplication codes in a structured, portable and efficient manner. The resulting codes are validated on three different representative RISC and CISC architectures on which they significantly outperform highly/optimized libraries such as ATLAS and other competing methodologies reported in the literature. The main component of the proposed approach is a hierarchical storage format that efficiently generalizes the applicability of the memory hierarchy friendly Morton ordering to arbitrary-sized matrices. The storage format supports polyalgorithms, which are shown here to be essential for obtaining the best possible performance for a range of problem sizes. Several algorithmic advances are made in this paper, including an oscillating iterative algorithm for matrix multiplication and a variable recursion cutoff criterion for Strassen's algorithm. The authors expose the need to standardize linear algebra kernel interfaces, distinct from the BLAS, for writing portable high-performance code. These kernel routines operate on small blocks that fit in the L1 cache. The performance advantages of the proposed framework can be effectively delivered to new and existing applications through the use of object-oriented or compiler-based approaches.
机译:尽管进行了广泛的研究,但是由于缺乏结构化的方法以及矩阵存储格式的局限性,大多数架构上的矩阵乘法(尤其是大型矩阵)以前并不容易获得最佳性能。本文介绍了一个简单而有效的框架,该框架为以结构化,可移植且高效的方式构建高性能矩阵乘法代码奠定了基础。所生成的代码在三种不同的代表性RISC和CISC体系结构上得到了验证,在这些体系结构上,它们明显优于高度/优化的库,例如ATLAS和文献中报道的其他竞争方法。所提出的方法的主要组成部分是一种分层存储格式,可以有效地将对存储器分层友好的Morton排序的适用性概括为任意大小的矩阵。存储格式支持多元算法,此处显示的多元算法对于在一系列问题大小中获得最佳性能至关重要。本文在算法上取得了一些进展,包括用于矩阵乘法的振荡迭代算法和用于Strassen算法的可变递归截止准则。作者提出了标准化线性代数内核接口(不同于BLAS)以编写可移植的高性能代码的需求。这些内核例程在适合L1缓存的小块上运行。通过使用面向对象或基于编译器的方法,可以将所提出框架的性能优势有效地传递给新的和现有的应用程序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号