首页> 外文学位 >Register pressure guided loop optimization.
【24h】

Register pressure guided loop optimization.

机译:记录压力导向回路优化。

获取原文
获取原文并翻译 | 示例

摘要

Digital Signal Processing(DSP) processors are a type of processor used for processing digital signals that are utilized in a very broad field. However, uncarefully designed loop optimizations implemented in an optimizing compiler for DSP processors cannot always deliver performance gain. Some reasons include causing too much register pressure or adding too much communication between register files to transfer values, called inter-cluster communication.;To control register pressure, predicting the register requirement before applying loop optimization can effectively prevent performance degradation. In this dissertation, we focus on two essential loop optimizations: scalar replacement and unroll-and-jam. We present two low cost register prediction methods for those loops in a high level representation with the consideration of other loop optimizations and general scalar optimizations before applying them. For unroll-and-jam, a performance model is also described to utilize prediction results to determine the unroll vector automatically from a given unroll space for achieving the best run-time performance.;Our prediction algorithm for scalar replacement predicts the floating-point register pressure of a loop within 2 registers and the integer register pressure within 2.7 registers on average with a time complexity of O( n2) in practice where n is the number of nodes in the data dependence graph used. This algorithm achieves similar performance to the best previous approach, having O(n 3) complexity. For the prediction algorithm for unroll-and-jam, our experiments show that it predicts the floating point register pressure within 3 registers and the integer register pressure within 4 registers. With this algorithm, for 92% of the test loops in our test suite, the performance model can pick the unroll vectors that achieve the best loop performance or performance close to the best. Also for the Polyhedron benchmark, our register pressure guided unroll-and-jam improves the overall performance about 2% over the model in the industry-leading optimizing Open64 backend on both 32bit and 64bit model for x86 and x86-64 architectures.;For inter-cluster communications, in this dissertation, a fusion algorithm is presented to consider the impacts from unroll-and-jam and scalar replacement and other optimizations for clustered VLIW architectures in order to provide the best overall performance as well as the minimum additional inter-cluster communications. In the experiments, this fusion algorithm applied with unroll-and-jam and scalar replacement speeds up all test loops from a factor of average 1.57 to 1.69, compared with the results by the similar optimizations but without fusion.;With the register pressure prediction algorithms and the demonstration of register pressure guided loop optimization, our research opens the door to completely eliminate the performance degradation of loop optimizations due to register pressure in the future. Loop fusion considering unroll-and-jam also helps a compiler to get better performance on a clustered VLIW architecture with a partitioned register bank.
机译:数字信号处理(DSP)处理器是一种处理器,用于处理在非常广泛的领域中使用的数字信号。但是,在针对DSP处理器的优化编译器中实施的设计不当的循环优化无法始终获得性能提升。某些原因包括导致过多的寄存器压力或在寄存器文件之间添加太多的通信以传递值,称为群集间通信。为了控制寄存器压力,在应用循环优化之前预测寄存器的需求可以有效防止性能下降。在本文中,我们主要关注两个基本的循环优化:标量替换和展开和阻塞。在应用它们之前,我们在考虑其他循环优化和常规标量优化的前提下,以高级表示形式为这些循环提供了两种低成本的寄存器预测方法。对于展开和阻塞,还描述了一种性能模型,该模型利用预测结果从给定的展开空间自动确定展开向量,以实现最佳的运行时性能。我们的标量替换预测算法可预测浮点寄存器在实践中,平均2个寄存器内的循环压力和2.7个寄存器内的整数寄存器压力的时间复杂度为O(n2),其中n是所使用的数据依赖图中的节点数。该算法具有O(n 3)复杂度,其性能与以前的最佳方法相似。对于展开和卡纸预测算法,我们的实验表明,该算法可预测3个寄存器内的浮点寄存器压力和4个寄存器内的整数寄存器压力。使用此算法,对于我们测试套件中92%的测试循环,性能模型可以选择实现最佳循环性能或接近最佳性能的展开向量。同样对于Polyhedron基准测试,我们的寄存器压力引导下的展开和卡纸性能比行业领先的针对x86和x86-64体系结构的32位和64位模型上优化的Open64后端模型的整体性能提高了约2%。集群通信,本文提出了一种融合算法,以考虑展开干扰和标量替换以及对集群VLIW体系结构进行的其他优化的影响,以便提供最佳的总体性能以及最少的附加集群间通讯。在实验中,与类似优化但没有融合的结果相比,该融合算法结合了展开干扰和标量替换,可将所有测试循环的速度从平均1.57倍提高到1.69倍。并演示了套准压力引导的回路优化,我们的研究为彻底消除将来由于套准压力而导致的回路优化性能下降打开了大门。考虑到展开和阻塞的循环融合还有助于编译器在具有分区寄存器组的群集VLIW体系结构上获得更好的性能。

著录项

  • 作者

    Ma, Yin.;

  • 作者单位

    Michigan Technological University.;

  • 授予单位 Michigan Technological University.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2007
  • 页码 164 p.
  • 总页数 164
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号