首页> 外文期刊>Microprocessors and microsystems >Streamlining Long Latency Instructions For Seamlessly Combined Out-of-order And In-order Execution
【24h】

Streamlining Long Latency Instructions For Seamlessly Combined Out-of-order And In-order Execution

机译:精简长时延指令,以无缝组合无序和有序执行

获取原文
获取原文并翻译 | 示例

摘要

In the current day wide-issue processors, the size of the instruction scheduling window (also called Issue Queue (IQ)) is limited mainly by the hardware complexity to design the logic, and thus limits the number of instructions scanned every cycle to extract instruction level parallelism (ILP). To exacerbate the problems, instructions depending on long latency load operations continue to reside in the IQ until their source operands are ready. Thus, such delayed instructions block any new instructions from entering the IQ even if potentially they are ready for execution. The growing disparity in processor and memory speeds is further aggravating the delay in dislodging instructions from IQ. To alleviate the problem, in this paper we propose a novel technique to streamline instructions in separate buffers according to the chain of dependencies. Each instruction is streamlined behind a parent instruction while it waits for the source operand to be supplied by the long latency memory operations. These instructions are segregated from the IQ and thus the pressure on IQ is relieved which enables flow of potentially executable instructions in the pipeline.Our analysis of SPEC2000 programs reveals that instructions dependent on load cache misses or their dependents, typically have their first source operand ready within 5-15% of their total wait time in the IQ. Based on the observations, the long latency memory dependent instructions are streamlined into in-order buffers when their first operand is ready. In the proposed architecture, instructions from both the conventional IQ and the heads of the streamline buffers can be selected for execution, while the wakeup logic complexity remains same as in the conventional design. Our results show that the performance speedup of 32-entry IQ supplemented by 32 in-order buffers is 15.7% and 2% for FP and integer benchmark respectively, which is very much comparable to that of a conventional 64-entry IQ. A 64-entry IQ design can gain performance over a 32-entry IQ, albeit with a large overhead in circuit delay complexity of wakeup logic, while streamline buffers can gain performance over 32-entry IQ without any such overhead.
机译:在当今的大问题处理器中,指令调度窗口(也称为问题队列(IQ))的大小主要受到硬件复杂性的限制,以设计逻辑,从而限制了每个周期提取指令的扫描数量。级别并行(ILP)。为了加剧这些问题,取决于长等待时间加载操作的指令将继续驻留在IQ中,直到其源操作数准备就绪为止。因此,这种延迟的指令即使可能已经准备好执行,也会阻止任何新指令进入IQ。处理器和内存速度上日益增长的差异进一步加剧了从IQ移出指令的延迟。为了缓解该问题,在本文中,我们提出了一种新颖的技术,可以根据相关性链简化单独缓冲区中的指令。在等待长等待时间存储器操作提供源操作数的同时,每条指令都在父指令之后进行了简化。这些指令与IQ隔离开来,从而减轻了IQ压力,使潜在的可执行指令可以在管道中流动。我们对SPEC2000程序的分析表明,依赖于装载高速缓存未命中或其依赖的指令,通常已经准备好了第一个源操作数在智商中他们总等待时间的5-15%之内。基于这些观察,当长延迟存储器相关指令的第一个操作数就绪时,它们会被精简为有序缓冲区。在所提出的架构中,可以选择来自常规IQ和流线缓冲器头部的指令来执行,而唤醒逻辑复杂度与常规设计相同。我们的结果表明,FP和整数基准测试的32项IQ加上32个有序缓冲区的性能提速分别为15.7%和2%,与传统的64项IQ相当。尽管在唤醒逻辑的电路延迟复杂度方面存在较大开销,但64项IQ设计可以获得比32项IQ更高的性能,而流线缓冲器则可以在32项IQ之上获得性能而没有任何此类开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号