首页> 外文会议>Real-time systems symposium >A high-performance SIMD floating point unit for BlueGene/L: architecture, compilation, and algorithm design
【24h】

A high-performance SIMD floating point unit for BlueGene/L: architecture, compilation, and algorithm design

机译:用于BlueGene / L的高性能SIMD浮点单元:架构,编译和算法设计

获取原文
获取原文并翻译 | 示例

摘要

We describe the design, implementation, and evaluation of a dual-issue SIMD-like extension of the PowerPC 440 floating-point unit (FPU) core. This extended FPU is targeted at both IBM's massively parallel BlueGene/L machine as well as more pervasive embedded platforms. It has several novel features, such as a computational crossbar and cross-load/store instructions, which enhance the performance of numerical codes. We further discuss the hardware-software co-design that was essential to fully realize the performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a BlueGene/L node. We describe several novel compiler and algorithmic techniques to take advantage of this architecture. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Preliminary performance data shows that the algorithm-compiler-hardware combination delivers a significant fraction of peak floating-point performance for compute-bound kernels such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memory-bound kernels such as daxpy, while being largely insensitive to data alignment.
机译:我们描述了PowerPC 440浮点单元(FPU)内核的类双SIMD扩展的设计,实现和评估。这种扩展的FPU既针对IBM的大规模并行BlueGene / L机器,也针对更普及的嵌入式平台。它具有一些新颖的功能,例如计算交叉开关和交叉加载/存储指令,可增强数字代码的性能。当内存带宽限制和BlueGene / L节点上的内存层次结构对未对齐的数据访问施加高额罚款时,我们将进一步讨论对于充分实现FPU的性能优势至关重要的硬件-软件协同设计。我们描述了几种新颖的编译器和算法技术,以利用这种体系结构。使用针对关键线性代数内核的手动优化和已编译代码,我们验证了架构设计选择,评估了编译器的成功,并量化了新颖算法设计技术的有效性。初步的性能数据表明,算法-编译器-硬件组合为计算绑定内核(例如矩阵乘法)提供了很大一部分的峰值浮点性能,并为内存绑定内核(例如daxpy)提供了很大一部分的峰值内存带宽。同时对数据对齐不敏感。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号