Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Roman Wyrzykowski; Krzysztof Rojek; Lukasz Szustak

首页> 外文期刊>Parallel Computing >Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

【24h】

Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

机译：模型驱动的双精度矩阵乘法对Cell处理器体系结构的适应

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The main delivery of this paper is a model-driven approach to adaptation of the double-precision matrix multiplication to architectures of blade systems based on two types of Cell processors. A hierarchical algorithm used for adaptation consists of four levels. The first level provides sharing computation among all the 16 SPE cores of the IBM BladeCenter QS21 or QS22. The second level corresponds to a macro-kernel, and is responsible for the data management in the main memory, as well as communication between the main memory and local stores of SPE cores. Each macro-kernel operation is implemented within the local store of an SPE core. The third level corresponds to a kerne! Of the algorithm; each kernel operation is implemented on a single SPE within its local store as a sequence of micro-kernel operations. The fourth level is a micro-kernel implemented within the register file of an SPE core.rnThe proposed approach is based on two performance models. The purpose of the first model is optimization of communication across all 16 SPE cores of the IBM BladeCenter, including the main memory and local stores of SPEs. It is constructed as a function of size of matrix blocks. This model allows for selecting "the best" size of the macro-kernel. The second performance model is aiming at optimization of computations within a single SPE core, taking into account constraints on traffic between the local store and register file of SPE. The model accounts for such factors as size of local store, number of registers, properties of double-precision operations, balance between pipelines, etc. This model allows for selecting "the best" size of kernel and micro-kernel operations.rnThe model-driven adaptation is followed by a series of systematic optimization steps. They include loop unrolling, double buffering on register and memory levels, as well as using NUMA library.rnThe proposed adaptation and optimization steps are fully implemented in C language, without optimizing code manually. For the IBM QS21 system, which uses two Cell processors of the first generation, this implementation allows for achieving 27.24 Gflop/s, which is 93.1% of the peak performance. This result is obtained for matrices of size 4096 by 4096. For the IBM QS22 system, based on PowerXCell 8i processors, the performance of double-precision arithmetic is extremely higher, so 184.4 Gflop/s is achieved, as 90.0% of the peak performance. This result is reported for the matrix multiplication of size 15,872 by 15,872. The overall performance could be slightly improved by substituting the macro-kernel developed in this work with the highly optimized Cell BLAS dgemm_64x64 kernel.

机译：本文的主要内容是一种模型驱动的方法，用于将双精度矩阵乘法调整为基于两种类型的Cell处理器的刀片系统的体系结构。用于适应的分层算法包括四个级别。第一级提供IBM BladeCenter QS21或QS22的所有16个SPE内核之间的共享计算。第二级对应于宏内核，并负责主存储器中的数据管理以及主存储器与SPE核心的本地存储之间的通信。每个宏内核操作都在SPE核心的本地存储中实现。第三级对应于一个kerne！的算法；每个内核操作都在其本地存储区中的单个SPE上实现为一系列微内核操作。第四层是在SPE内核的寄存器文件中实现的微内核。建议的方法基于两个性能模型。第一个模型的目的是优化IBM BladeCenter的所有16个SPE内核之间的通信，包括SPE的主内存和本地存储。它被构造为矩阵块大小的函数。此模型允许选择宏内核的“最佳”大小。第二种性能模型旨在优化单个SPE内核中的计算，同时考虑到SPE本地存储和寄存器文件之间的流量约束。该模型考虑了以下因素：本地存储的大小，寄存器的数量，双精度操作的属性，流水线之间的平衡等。该模型允许选择“最佳”的内核和微内核操作的大小。驱动的适应之后是一系列系统优化步骤。它们包括循环展开，寄存器和内存级别的双缓冲以及使用NUMA库。建议的适应和优化步骤完全用C语言实现，而无需手动优化代码。对于使用两个第一代Cell处理器的IBM QS21系统，此实现可实现27.24 Gflop / s，这是峰值性能的93.1％。对于大小为4096 x 4096的矩阵，可获得此结果。对于IBM QS22系统，基于PowerXCell 8i处理器，双精度算法的性能极高，因此可达到184.4 Gflop / s，为峰值性能的90.0％。。对于大小为15872乘以15872的矩阵乘法，报告了此结果。通过使用高度优化的Cell BLAS dgemm_64x64内核代替在这项工作中开发的宏内核，可以稍微提高整体性能。

著录项

来源
《Parallel Computing》 |2012年第5期|p.260-276|共17页
作者
Roman Wyrzykowski; Krzysztof Rojek; Lukasz Szustak;
展开▼
作者单位

Institute of Computer and Information Sciences, Czestochowa University of Technology, Poland;

Institute of Computer and Information Sciences, Czestochowa University of Technology, Poland;

Institute of Computer and Information Sciences, Czestochowa University of Technology, Poland;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
dual-precision matrix multiplication; multicore architectures; cell processor; parallel algorithms; adaptation; performance models;

机译：双精度矩阵乘法;多核架构;单元处理器;并行算法;适应;表现模型;

相似文献

外文文献
中文文献
专利

1. Optimizing Matrix Multiplication For A Short-vector Simd Architecture - Cell Processor [J] . Jakub Kurzak, Wesley Alvaro, Jack Dongarra Parallel Computing . 2009,第3期

机译：优化短向量Simd架构的矩阵乘法-单元处理器
2. Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors [J] . Catalan Sandra, Igual Francisco D., Mayo Rafael, Cluster computing . 2016,第3期

机译：非对称多核处理器上的架构感知配置和矩阵乘法调度
3. Performance of an embedded optical vector matrix multiplication processor architecture [J] . Yang C., Cui G.X., Huang Y.Y., Optoelectronics, IET . 2010,第4期

机译：嵌入式光学矢量矩阵乘法处理器体系结构的性能
4. Adaptation of Double-Precision Matrix Multiplication to the Cell Broadband Engine Architecture [C] . Krzysztof Rojek, Lukasz Szustak International conference on parallel processing and applied mathematics;PPAM 2010 . 2010

机译：双精度矩阵乘法对小区宽带引擎架构的适应
5. A Novel Processing-In-Memory Architecture for Dense and Sparse Matrix Multiplications [D] . Bear, Andrew Robert 2019

机译：一种用于密集和稀疏矩阵乘法的新型处理内存架构
6. Cognitive Processing Therapy for Spanish-speaking Latinos: A formative study of a model-driven cultural adaptation of the manual to enhance implementation in a usual care setting [O] . Sarah E. Valentine, Christina P. C. Borba, Louise Dixon, -1

机译：西班牙语拉丁裔的认知加工治疗：对模型驱动的文化适应性手册的形成性研究以增强在常规护理环境中的实施
7. Optimizing Matrix Multiplication for a Short-Vector SIMD Architecture – CELL Processor. [O] . Jakub Kurzak A, Wesley Alvaro A, Jack Dongarra A 2010

机译：优化短矢量sImD架构的矩阵乘法 - CELL处理器。

Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅