首页> 外文OA文献 >Selective Vectorization for Short-Vector Instructions
【2h】

Selective Vectorization for Short-Vector Instructions

机译:短矢量指令的选择性矢量化

摘要

Multimedia extensions are nearly ubiquitous in today's general-purpose processors. These extensions consist primarily of a set of short-vector instructions that apply the same opcode to a vector of operands. Vector instructions introduce a data-parallel component to processors that exploit instruction-level parallelism, and present an opportunity for increased performance. In fact, ignoring a processor's vector opcodes can leave a significant portion of the available resources unused. In order for software developers to find short-vector instructions generally useful, however, the compiler must target these extensions with complete transparency and consistent performance. This paper describes selective vectorization, a technique for balancing computation across a processor's scalar and vector units. Current approaches for targeting short-vector instructions directly adopt vectorizing technology first developed for supercomputers. Traditional vectorization, however, can lead to a performance degradation since it fails to account for a processor's scalar resources. We formulate selective vectorization in the context of software pipelining. Our approach creates software pipelines with shorter initiation intervals, and therefore, higher performance. A key aspect of selective vectorization is its ability to manage transfer of operands between vector and scalar instructions. Even when operand transfer is expensive, our technique is sufficiently sophisticated to achieve significant performance gains. We evaluate selective vectorization on a set of SPEC FP benchmarks. On a realistic VLIW processor model, the approach achieves whole-program speedups of up to 1.35x over existing approaches. For individual loops, it provides speedups of up to 1.75x.
机译:多媒体扩展在当今的通用处理器中几乎无处不在。这些扩展主要由一组短向量指令组成,这些指令将相同的操作码应用于操作数向量。向量指令向利用指令级并行性的处理器引入了数据并行组件,并为提高性能提供了机会。实际上,忽略处理器的向量操作码可能会使相当一部分可用资源闲置。为了使软件开发人员能够找到通常有用的短向量指令,编译器必须以完全透明和一致的性能来针对这些扩展。本文介绍了选择性矢量化技术,该技术可在处理器的标量和矢量单元之间平衡计算。针对短向量指令的当前方法直接采用了为超级计算机开发的向量化技术。但是,传统的矢量化可能会导致性能下降,因为它无法考虑处理器的标量资源。我们在软件流水线的背景下制定选择性向量化。我们的方法创建的软件管道具有较短的启动间隔,因此具有更高的性能。选择性向量化的一个关键方面是其管理向量和标量指令之间的操作数转移的能力。即使操作数传输成本很高,我们的技术也足够复杂,可以显着提高性能。我们根据一组SPEC FP基准评估选择性向量化。在一个现实的VLIW处理器模型上,该方法可比现有方法实现高达1.35倍的整个程序加速。对于单个循环,它可提供高达1.75倍的加速。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号