【24h】

Loop fusion for clustered VLIW architectures

机译:集群VLIW架构的循环融合

获取原文
获取原文并翻译 | 示例

摘要

Embedded systems require maximum performance from a processor within significant constraints in power consumption and chip cost. Using software pipelining, high-performance digital signal processors can often exploit considerable instruction-level parallelism (ILP), and thus significantly improve performance. However, software pipelining, in some instances, hinders the goals of low power consumption and low chip cost. Specifically, the registers required by a software pipelined loop may exceed the size of the physical register set.The register pressure problem incurred by software pipelining makes it difficult to build a high-performance embedded processor with a single, multi-ported register bank with enough registers to support high levels of ILP while maintaining clock speed and limiting power consumption. The large number of ports required to support a single register bank severely hampers access time. The port requirement for a register bank can be reduced via hardware by partitioning the register bank into multiple banks connected to disjoint subsets of functional units, called clusters. Since a functional unit is not directly connected to all register banks, wasted energy and resources can result due to delays incurred when accessing "non-local" registers.The overhead due to partitioning of the register set can be ameliorated by using high-level compiler loop optimization techniques such as unrolling, unroll-and-jam and fusion. High-level loop optimizations spread data-independent parallelism across clusters that may not require "non-local" register accesses and can provide work to hide the latency of any such register accesses that are needed.In this paper, we examine the effects of loop fusion on DSP loops run on four simulated, clustered VLIW architectures and the Texas Instruments TMS320C64x. Our experiments show a 1.3 -- 2 harmonic mean speedup.
机译:嵌入式系统要求处理器在最大程度地限制功耗和芯片成本的前提下实现最高性能。使用软件流水线,高性能数字信号处理器通常可以利用可观的指令级并行性(ILP),从而显着提高性能。然而,在某些情况下,软件流水线阻碍了低功耗和低芯片成本的目标。具体来说,软件流水线循环所需的寄存器可能会超过物理寄存器集的大小。软件流水线导致的寄存器压力问题使得难以使用单个多端口寄存器组构建具有足够功能的高性能嵌入式处理器寄存器以支持高级别的ILP,同时保持时钟速度并限制功耗。支持单个寄存器组所需的大量端口严重影响了访问时间。可以通过硬件将寄存器组划分为多个与功能单元的不相连子集(称为”连接的组)来通过硬件降低寄存器组的端口要求。由于功能单元未直接连接到所有寄存器组,因此访问“非本地”寄存器时可能会由于延迟而导致能源和资源浪费。可以通过使用高级编译器来缓解因寄存器集分区而导致的开销。循环优化技术,例如展开,展开和卡塞以及融合。高级循环优化可以在不需要“非本地”寄存器访问的群集中分布与数据无关的并行性,并且可以提供隐藏所需的任何此类寄存器访问延迟的工作。在本文中,我们研究了循环的影响DSP循环上的融合在四种模拟的群集VLIW架构和Texas Instruments TMS320C64x上运行。我们的实验显示平均谐波加速1.3-2。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号