首页> 外文会议>Real-time systems symposium >A high-performance SIMD floating point unit for BlueGene/L: architecture, compilation, and algorithm design

【24h】

A high-performance SIMD floating point unit for BlueGene/L: architecture, compilation, and algorithm design

机译：用于BlueGene / L的高性能SIMD浮点单元：架构，编译和算法设计

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

We describe the design, implementation, and evaluation of a dual-issue SIMD-like extension of the PowerPC 440 floating-point unit (FPU) core. This extended FPU is targeted at both IBM's massively parallel BlueGene/L machine as well as more pervasive embedded platforms. It has several novel features, such as a computational crossbar and cross-load/store instructions, which enhance the performance of numerical codes. We further discuss the hardware-software co-design that was essential to fully realize the performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a BlueGene/L node. We describe several novel compiler and algorithmic techniques to take advantage of this architecture. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Preliminary performance data shows that the algorithm-compiler-hardware combination delivers a significant fraction of peak floating-point performance for compute-bound kernels such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memory-bound kernels such as daxpy, while being largely insensitive to data alignment.

机译：我们描述了PowerPC 440浮点单元（FPU）内核的类双SIMD扩展的设计，实现和评估。这种扩展的FPU既针对IBM的大规模并行BlueGene / L机器，也针对更普及的嵌入式平台。它具有一些新颖的功能，例如计算交叉开关和交叉加载/存储指令，可增强数字代码的性能。当内存带宽限制和BlueGene / L节点上的内存层次结构对未对齐的数据访问施加高额罚款时，我们将进一步讨论对于充分实现FPU的性能优势至关重要的硬件-软件协同设计。我们描述了几种新颖的编译器和算法技术，以利用这种体系结构。使用针对关键线性代数内核的手动优化和已编译代码，我们验证了架构设计选择，评估了编译器的成功，并量化了新颖算法设计技术的有效性。初步的性能数据表明，算法-编译器-硬件组合为计算绑定内核（例如矩阵乘法）提供了很大一部分的峰值浮点性能，并为内存绑定内核（例如daxpy）提供了很大一部分的峰值内存带宽。同时对数据对齐不敏感。

著录项

来源
《Real-time systems symposium》|1992年|p.85-96|共12页
会议地点 Phoenix AZ(US);Phoenix AZ(US)
作者
Bachega L.; Siddhartha Chatterjee; Dockser K.A.; Gunnels J.A.; Manish Gupta; Gustavson F.G.; Lapkowski C.A.; Liu G.K.; Mendell M.P.; Wait C.D.; Ward T.J.C.;
展开▼
作者单位

IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类信息处理（信息加工）;
关键词

相似文献

外文文献
中文文献
专利

1. Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L [J] . S Chatterjee, L. R. Bachega, P. Bergner, IBM Journal of Research and Development . 2005,第2a3期

机译：用于Blue Gene / L的高性能SIMD浮点单元的设计和开发
2. Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L [J] . IBM Journal of Research and Development . 2005,第2期

机译：用于Blue Gene / L的高性能SIMD浮点单元的设计和开发
3. Low-Cost Binary128 Floating-Point FMA Unit Design with SIMD Support [J] . Libo Huang Computers, IEEE Transactions on . 2012,第5期

机译：具有SIMD支持的低成本Binary128浮点FMA单元设计
4. A high-performance SIMD floating point unit for BlueGene/L: architecture, compilation, and algorithm design [C] . Bachega, L., Siddhartha Chatterjee, . 2004

机译：用于BlueGene / L的高性能SIMD浮点单元：架构，编译和算法设计
5. Floating-point unit design using Taylor-series expansion algorithms. [D] . Kwon, Taek-Jun. 2009

机译：使用泰勒级数展开算法的浮点单元设计。
6. Exploration of a Capability-Focused Aerospace System of Systems Architecture Alternative with Bilayer Design Space Based on RST-SOM Algorithmic Methods [O] . Zhifei Li, Dongliang Qin, Feng Yang -1

机译：基于RST-SOM算法的以双层设计空间为中心的以系统架构替代能力为重点的航空航天系统的探索
7. A high-performance simd floating point unit for bluegene/l: Architecture, compilation, and algorithm design [O] . Leonardo Bachega, Siddhartha Chatterjee, Kenneth A. Dockser, 2004

机译：用于bluegene / l的高性能simd浮点单元：架构，编译和算法设计

A high-performance SIMD floating point unit for BlueGene/L: architecture, compilation, and algorithm design

摘要

著录项

相似文献

相关主题

期刊订阅