首页> 外文期刊>Parallel Computing >Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture
【24h】

Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

机译:模型驱动的双精度矩阵乘法对Cell处理器体系结构的适应

获取原文
获取原文并翻译 | 示例

摘要

The main delivery of this paper is a model-driven approach to adaptation of the double-precision matrix multiplication to architectures of blade systems based on two types of Cell processors. A hierarchical algorithm used for adaptation consists of four levels. The first level provides sharing computation among all the 16 SPE cores of the IBM BladeCenter QS21 or QS22. The second level corresponds to a macro-kernel, and is responsible for the data management in the main memory, as well as communication between the main memory and local stores of SPE cores. Each macro-kernel operation is implemented within the local store of an SPE core. The third level corresponds to a kerne! Of the algorithm; each kernel operation is implemented on a single SPE within its local store as a sequence of micro-kernel operations. The fourth level is a micro-kernel implemented within the register file of an SPE core.rnThe proposed approach is based on two performance models. The purpose of the first model is optimization of communication across all 16 SPE cores of the IBM BladeCenter, including the main memory and local stores of SPEs. It is constructed as a function of size of matrix blocks. This model allows for selecting "the best" size of the macro-kernel. The second performance model is aiming at optimization of computations within a single SPE core, taking into account constraints on traffic between the local store and register file of SPE. The model accounts for such factors as size of local store, number of registers, properties of double-precision operations, balance between pipelines, etc. This model allows for selecting "the best" size of kernel and micro-kernel operations.rnThe model-driven adaptation is followed by a series of systematic optimization steps. They include loop unrolling, double buffering on register and memory levels, as well as using NUMA library.rnThe proposed adaptation and optimization steps are fully implemented in C language, without optimizing code manually. For the IBM QS21 system, which uses two Cell processors of the first generation, this implementation allows for achieving 27.24 Gflop/s, which is 93.1% of the peak performance. This result is obtained for matrices of size 4096 by 4096. For the IBM QS22 system, based on PowerXCell 8i processors, the performance of double-precision arithmetic is extremely higher, so 184.4 Gflop/s is achieved, as 90.0% of the peak performance. This result is reported for the matrix multiplication of size 15,872 by 15,872. The overall performance could be slightly improved by substituting the macro-kernel developed in this work with the highly optimized Cell BLAS dgemm_64x64 kernel.
机译:本文的主要内容是一种模型驱动的方法,用于将双精度矩阵乘法调整为基于两种类型的Cell处理器的刀片系统的体系结构。用于适应的分层算法包括四个级别。第一级提供IBM BladeCenter QS21或QS22的所有16个SPE内核之间的共享计算。第二级对应于宏内核,并负责主存储器中的数据管理以及主存储器与SPE核心的本地存储之间的通信。每个宏内核操作都在SPE核心的本地存储中实现。第三级对应于一个kerne!的算法;每个内核操作都在其本地存储区中的单个SPE上实现为一系列微内核操作。第四层是在SPE内核的寄存器文件中实现的微内核。建议的方法基于两个性能模型。第一个模型的目的是优化IBM BladeCenter的所有16个SPE内核之间的通信,包括SPE的主内存和本地存储。它被构造为矩阵块大小的函数。此模型允许选择宏内核的“最佳”大小。第二种性能模型旨在优化单个SPE内核中的计算,同时考虑到SPE本地存储和寄存器文件之间的流量约束。该模型考虑了以下因素:本地存储的大小,寄存器的数量,双精度操作的属性,流水线之间的平衡等。该模型允许选择“最佳”的内核和微内核操作的大小。驱动的适应之后是一系列系统优化步骤。它们包括循环展开,寄存器和内存级别的双缓冲以及使用NUMA库。建议的适应和优化步骤完全用C语言实现,而无需手动优化代码。对于使用两个第一代Cell处理器的IBM QS21系统,此实现可实现27.24 Gflop / s,这是峰值性能的93.1%。对于大小为4096 x 4096的矩阵,可获得此结果。对于IBM QS22系统,基于PowerXCell 8i处理器,双精度算法的性能极高,因此可达到184.4 Gflop / s,为峰值性能的90.0%。 。对于大小为15872乘以15872的矩阵乘法,报告了此结果。通过使用高度优化的Cell BLAS dgemm_64x64内核代替在这项工作中开发的宏内核,可以稍微提高整体性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号