首页> 外文会议>1998 international conference on supercomputing >A Performance Study of Out-of-order Vector Architectures and Short Registers
【24h】

A Performance Study of Out-of-order Vector Architectures and Short Registers

机译:无序向量架构和短寄存器的性能研究

获取原文
获取原文并翻译 | 示例

摘要

This paper presents a study of the impact of reducing the vector register length in an out-of-order vector architecture. In traditional in-order vector architectures, long vector registers have typically been the norm. We start presenting data that shows that, even for highly vectorizable codes, only a small fraction of all elements of a long vector register are actually used. We also show that reducing the register size in a traditional vector architecture in an attempt to reduce hardware cost and maximize register utilization results in a severe performance degradation.rnHowever, when we combine out-of-order execution and short registers, our simulations show that the performance penalty can be made very small. Moreover, this new architecture tolerates memory latency much better than a traditional machine and uses the storage space in each register more efficiently. We present results for a selection of the Specfp 92 and Perfect Club codes that show speedups of the out-of-order machine over the traditional machine anywhere in the range 1.1 to 1.6. Halving the register size (from 16Kb in the out-of-order machine down to 8Kb) yields speedups around 1.3 and as high as 1.6. Even when reducing the register length to 1/4 the original size, speedups are still around 1.2 and when going to a register length of 16 elements (1/8 the original) most programs perform very close to the traditional in-order vector machine.
机译:本文提出了在无序向量架构中减少向量寄存器长度的影响的研究。在传统的有序向量体系结构中,通常使用长向量寄存器。我们开始提供的数据表明,即使对于高度可矢量化的代码,长向量寄存器中的所有元素中实际上也只有一小部分被使用。我们还表明,在传统的矢量架构中减小寄存器大小以降低硬件成本并最大化寄存器利用率会导致严重的性能下降。然而,当我们将无序执行和短寄存器结合在一起时,我们的仿真表明:性能损失可以做得很小。而且,这种新架构比传统机器更能容忍内存延迟,并且可以更有效地利用每个寄存器中的存储空间。我们提供了一些Specfp 92和Perfect Club代码的结果,这些代码显示了故障机器相对于传统机器在1.1到1.6范围内任何地方的加速情况。将寄存器大小减半(从乱序机器中的16Kb减小到8Kb)可以使速度提高1.3倍左右,最高达到1.6倍。即使将寄存器长度减小到原始大小的1/4,加速仍然保持在1.2左右,而当寄存器长度达到16个元素(原始大小的1/8)时,大多数程序执行起来都非常接近传统的有序向量机。

著录项

  • 来源
  • 会议地点 Melbourne(AU);Melbourne(AU)
  • 作者单位

    Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya-Barcelona, Spain;

    Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya-Barcelona, Spain;

    Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya-Barcelona, Spain;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 计算机的应用;
  • 关键词

  • 入库时间 2022-08-26 14:03:09

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号