首页> 外文OA文献 >A unified approach to instruction scheduling and register allocation on clustered VLIW processors
【2h】

A unified approach to instruction scheduling and register allocation on clustered VLIW processors

机译:集群VLIW处理器上用于指令调度和寄存器分配的统一方法

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

VLIW (Very Long Instruction Word) processors issue and execute multiple operations in parallel, on different functional units at each processor cycle. A major problem with VLIW processors is that a single register file hampers the scalability of the processor. Clustering is an efficient technique for improving the scalability and energy consumption of VLIW processors. In a clustered VLIW processor, each cluster has its own functional units and local register file with fewer registers and ports. Clusters are connected by an inter-cluster communication network. An optimising compiler plays a key role in improving the ILP (Instruction Level Parallelism) for clustered VLIW processors. Instruction scheduling and register allocation are two important parts in an optimising compiler for clustered VLIW processors. These two parts are closely related and have a significant impact on the ILP. This thesis studies instruction scheduling and register allocation on clustered VLIW processors, and presents a set of efficient techniques for these problems.Firstly, we study the problem of scheduling a set of operations of a basic block on a clustered VLIW processor where for any two functional units, the two subsets of operations executed on these two functional units are either the same or disjoint. The objective is to minimise the schedule length for all the operations in the basic block. We propose a novel priority-based, phase coupled heuristic. When computing the operation priorities, our heuristic considers not only the inter-operation latencies, but also the processor resource constraints. After computing the priorities, our heuristic schedules each operation on a functional unit of a cluster by considering different cases. Simulation results show that our heuristic significantly improves over two previous heuristics, UAS and Integrated, for the basic blocks taken from the MediaBench II benchmark suite based on six processors.Secondly, we study the problem of scheduling a set of operations of a basic block on a clustered VLIW processor where each functional unit can execute an arbitrary subset of operations. We extend the previous scheduling heuristic by using the maximum bipartite matching to find the latest start time and the earliest start time of an operation. Simulation results show that our extended heuristic significantly improves over UAS and Integrated for the basic blocks taken from the MediaBench II benchmark suite based on three processors.Thirdly, we study the problem of integrated instruction scheduling and register allocation. We propose a unified approach which integrates instruction scheduling, cluster assignment, and register allocation into a single phase. Our unified approach aims at minimising the overall schedule length of each basic block in a program, as well as balancing the register pressure on each cluster. It consists of an instruction scheduler and an incremental register allocator. The instruction scheduler schedules all the basic blocks of a program in reverse postorder, and all the operations of each basic block based on their priorities. The operation priorities are dynamically updated to reduce the register pressure during the instruction scheduling. When scheduling an operation, the instruction scheduler allocates physical registers to virtual registers of the operation by using the incremental register allocator. When performing cluster assignment for an operation,the instruction scheduler considers both the start time of the operation and the register pressure on each cluster, leading to a balanced usage of registers on all the clusters. Three different register allocation strategies and a register spill strategy have been developed. The simulation results show that our approach outperforms a previous approach, CARS, for three processors, in terms of the average schedule lengths of the basic blocks taken from the MediaBench II benchmark suite.Lastly, we extend our unified approach to instruction scheduling and register allocation by considering lifetime holes. A lifetime hole is an interval in which a variable does not contain a valid value. We propose efficient techniques for utilising lifetime holes on the-fly during register allocation and incorporate these techniques into our unified approach to instruction scheduling and register allocation. Furthermore, we present our simulation results showing that the unified approach considering lifetime holes performs better than the unified approach without considering lifetime holes.
机译:VLIW(超长指令字)处理器在每个处理器周期在不同的功能单元上并行发出和执行多个操作。 VLIW处理器的一个主要问题是单个寄存器文件妨碍了处理器的可伸缩性。群集是提高VLIW处理器的可伸缩性和能耗的有效技术。在群集的VLIW处理器中,每个群集都有其自己的功能单元和具有较少寄存器和端口的本地寄存器文件。群集通过群集间通信网络连接。优化的编译器在改善群集VLIW处理器的ILP(指令级并行性)方面起着关键作用。指令调度和寄存器分配是针对群集VLIW处理器的优化编译器中的两个重要部分。这两个部分紧密相关,并且对ILP具有重大影响。本文研究了集群VLIW处理器上的指令调度和寄存器分配,并提出了解决这些问题的有效技术。首先,我们研究了在集群VLIW处理器上调度一组基本块操作的问题,其中任意两个功能单元,在这两个功能单元上执行的操作的两个子集相同或不相交。目的是使基本块中所有操作的调度时间最小化。我们提出了一种新颖的基于优先级的相位耦合启发式算法。在计算操作优先级时,我们的启发式方法不仅考虑了操作间的延迟,还考虑了处理器资源的限制。在计算了优先级之后,我们的启发式方法通过考虑不同的情况来调度集群功能单元上的每个操作。仿真结果表明,对于基于六个处理器的MediaBench II基准测试套件中的基本块,我们的启发式算法大大改进了之前的两个启发式算法UAS和Integrated。集群式VLIW处理器,其中每个功能单元可以执行任意操作子集。我们通过使用最大二分匹配来找到操作的最新开始时间和最早开始时间来扩展先前的调度启发式方法。仿真结果表明,针对基于三个处理器的MediaBench II基准测试套件中的基本块,我们的扩展启发式算法相对于UAS和Integrated有了显着改进。第三,研究了集成指令调度和寄存器分配问题。我们提出了一种统一的方法,该方法将指令调度,集群分配和寄存器分配集成到一个阶段。我们的统一方法旨在最大程度地减少程序中每个基本块的总调度时间,并平衡每个群集上的寄存器压力。它由一个指令调度器和一个增量寄存器分配器组成。指令调度器以相反的后顺序调度程序的所有基本块,并根据优先级调度每个基本块的所有操作。动态地更新操作优先级,以减少指令调度期间的寄存器压力。在调度操作时,指令调度器通过使用增量寄存器分配器将物理寄存器分配给操作的虚拟寄存器。在为操作执行群集分配时,指令调度程序会同时考虑操作的开始时间和每个群集上的寄存器压力,从而平衡所有群集上的寄存器。已经开发了三种不同的寄存器分配策略和寄存器溢出策略。仿真结果表明,就MediaBench II基准测试套件中获取的基本块的平均调度长度而言,我们的方法在三个处理器上的性能优于先前的方法CARS。最后,我们将统一方法扩展到指令调度和寄存器分配通过考虑终身漏洞。生存期漏洞是一个变量不包含有效值的时间间隔。我们提出了在寄存器分配期间动态利用生命周期漏洞的有效技术,并将这些技术整合到我们用于指令调度和寄存器分配的统一方法中。此外,我们给出的仿真结果表明,考虑寿命孔的统一方法比不考虑寿命孔的统一方法性能更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号