【24h】

Loop fusion for clustered VLIW architectures

机译:群集VLIW架构的循环融合

获取原文

摘要

Embedded systems require maximum performance from a processor within significant constraints in power consumption and chip cost. Using software pipelining, high-performance digital signal processors can often exploit considerable instruction-level parallelism (ILP), and thus significantly improve performance. However, software pipelining, in some instances, hinders the goals of low power consumption and low chip cost. Specifically, the registers required by a software pipelined loop may exceed the size of the physical register set.The register pressure problem incurred by software pipelining makes it difficult to build a high-performance embedded processor with a single, multi-ported register bank with enough registers to support high levels of ILP while maintaining clock speed and limiting power consumption. The large number of ports required to support a single register bank severely hampers access time. The port requirement for a register bank can be reduced via hardware by partitioning the register bank into multiple banks connected to disjoint subsets of functional units, called clusters. Since a functional unit is not directly connected to all register banks, wasted energy and resources can result due to delays incurred when accessing "non-local" registers.The overhead due to partitioning of the register set can be ameliorated by using high-level compiler loop optimization techniques such as unrolling, unroll-and-jam and fusion. High-level loop optimizations spread data-independent parallelism across clusters that may not require "non-local" register accesses and can provide work to hide the latency of any such register accesses that are needed.In this paper, we examine the effects of loop fusion on DSP loops run on four simulated, clustered VLIW architectures and the Texas Instruments TMS320C64x. Our experiments show a 1.3 -- 2 harmonic mean speedup.
机译:嵌入式系统需要从处理器的最大性能,以在功耗和芯片成本的显着限制内。使用软件流水线,高性能数字信号处理器通常可以利用相当大的指令级并行性(ILP),从而显着提高性能。然而,在某些情况下,软件流水线阻碍了低功耗和低芯片成本的目标。具体地,软件流水线循环所需的寄存器可能超过物理寄存器集的大小。软件流水线产生的寄存器压力问题使得难以构建具有足够的单个多端寄存器库的高性能嵌入式处理器寄存器支持高水平的ILP,同时保持时钟速度和限制功耗。支持单个寄存器银行的大量端口严重妨碍访问时间。可以通过将寄存器组分区到连接到功能单元的辅助子集的多个银行中,通过硬件减少寄存器库的端口要求,称为群集。由于功能单元未直接连接到所有寄存器库,因此由于访问“非本地”寄存器时产生的延迟而导致浪费的能量和资源可以导致。由于使用高级编译器可以改善引起的寄存器集的分区引起的开销循环优化技术,如展开,展开和滤饼和融合。高级循环优化在可能不需要“非本地”寄存器访问的集群中扩展数据无关的并行性,并且可以提供用于隐藏所需任何此类寄存器访问的延迟的工作。在本文中,我们检查循环的效果DSP循环融合在四个模拟,集群的VLIW架构和Texas Instruments TMS320C64x上运行。我们的实验表明了1.3 - 2次谐波平均加速。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号