【24h】

Minimal Data Copy for Dense Linear Algebra Factorization

机译:密集线性代数分解的最小数据复制

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

The full format data structures of Dense Linear Algebra hurt the performance of its factorization algorithms. Full format rectangular matrices are the input and output of level the 3 BLAS. It follows that the LAPACK and Level 3 BLAS approach has a basic performance flaw. We describe a new result that shows that representing a matrix A as a collection of square blocks will reduce the amount of data reformating required by dense linear algebra factorization algorithms from O(n~3) to O(n~2). On an IBM Power3 processor our implementation of Cholesky factorization achieves 92% of peak performance whereas conventional full format LAPACK dpotrf achieves 77% of peak performance. All programming for our new data structures may be accomplished in standard Fortran, through the use of higher dimensional full format arrays. Thus, new compiler support may not be necessary. We also discuss the role of concatenating submatrices to facilitate hardware streaming. Finally, we discuss a new concept which we call the L1 / L0 cache interface.
机译:密集线性代数的全格式数据结构损害了其分解算法的性能。全格式矩形矩阵是3 BLAS级别的输入和输出。因此,LAPACK和3级BLAS方法具有基本的性能缺陷。我们描述了一个新的结果,该结果表明将矩阵A表示为正方形块的集合将减少密集线性代数分解算法所需的数据重整量,从O(n〜3)到O(n〜2)。在IBM Power3处理器上,我们对Cholesky因数分解的实现实现了92%的峰值性能,而传统的全格式LAPACK dpotrf则实现了77%的峰值性能。通过使用更高维度的完整格式数组,可以在标准Fortran中完成对我们新数据结构的所有编程。因此,可能不需要新的编译器支持。我们还将讨论级联子矩阵以促进硬件流传输的作用。最后,我们讨论一个称为L1 / L0缓存接口的新概念。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号