【24h】

Optimized Unrolling of Nested Loops

机译:优化展开嵌套循环

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include a) a more detailed cost model that includes ILP and I-cache considerations, b) a new code generation algorithm for unrolling nested loops that generates more compact code (with fewer remainder loops) than the unroll-and-jam transformation, and c) a new algorithm for efficiently enumerating feasible unroll vectors. Our experimental results confirm the wide applicability of our approach by showing a 2.2X speedup on matrix multiply, and an average l.08X speedup on seven of the SPEC95fp benchmarks (with a 1.2X speedup for two benchmarks). These speedups are significant because the baseline compiler used for comparison is the IBM XL Fortran product compiler which generates high quality code with unrolling and software pipelining of innermost loops enabled. Larger performance improvements due to unrolling of nested loops can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).
机译:在本文中,我们解决了以下问题:自动选择完美嵌套循环的展开因子,并为选定的展开因子生成紧凑的代码。与过去的工作相比,我们的工作包括:a)包含ILP和I-cache注意事项的更详细的成本模型,b)用于展开嵌套循环的新代码生成算法,该算法生成的紧凑代码(剩余循环更少)比c)高效枚举可行的展开向量的新算法。我们的实验结果通过显示矩阵乘法的2.2倍加速和七个SPEC95fp基准的平均1.08倍加速(两个基准的1.2倍加速)证实了我们方法的广泛适用性。这些加速速度非常重要,因为用于比较的基准编译器是IBM XL Fortran产品编译器,该产品编译器生成了高质量的代码,并启用了最内部循环的展开和软件管道。与我们用于测量的处理器(PowerPC 604)相比,在具有更多寄存器和更大程度的指令级并行度的处理器上,可以预期由于嵌套循环的展开而带来的更大性能改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号