Streamlining Long Latency Instructions For Seamlessly Combined Out-of-order And In-order Execution

Hui Wang; Rama Sangireddy

首页> 外文期刊>Microprocessors and microsystems >Streamlining Long Latency Instructions For Seamlessly Combined Out-of-order And In-order Execution

【24h】

Streamlining Long Latency Instructions For Seamlessly Combined Out-of-order And In-order Execution

机译：精简长时延指令，以无缝组合无序和有序执行

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

In the current day wide-issue processors, the size of the instruction scheduling window (also called Issue Queue (IQ)) is limited mainly by the hardware complexity to design the logic, and thus limits the number of instructions scanned every cycle to extract instruction level parallelism (ILP). To exacerbate the problems, instructions depending on long latency load operations continue to reside in the IQ until their source operands are ready. Thus, such delayed instructions block any new instructions from entering the IQ even if potentially they are ready for execution. The growing disparity in processor and memory speeds is further aggravating the delay in dislodging instructions from IQ. To alleviate the problem, in this paper we propose a novel technique to streamline instructions in separate buffers according to the chain of dependencies. Each instruction is streamlined behind a parent instruction while it waits for the source operand to be supplied by the long latency memory operations. These instructions are segregated from the IQ and thus the pressure on IQ is relieved which enables flow of potentially executable instructions in the pipeline.Our analysis of SPEC2000 programs reveals that instructions dependent on load cache misses or their dependents, typically have their first source operand ready within 5-15% of their total wait time in the IQ. Based on the observations, the long latency memory dependent instructions are streamlined into in-order buffers when their first operand is ready. In the proposed architecture, instructions from both the conventional IQ and the heads of the streamline buffers can be selected for execution, while the wakeup logic complexity remains same as in the conventional design. Our results show that the performance speedup of 32-entry IQ supplemented by 32 in-order buffers is 15.7% and 2% for FP and integer benchmark respectively, which is very much comparable to that of a conventional 64-entry IQ. A 64-entry IQ design can gain performance over a 32-entry IQ, albeit with a large overhead in circuit delay complexity of wakeup logic, while streamline buffers can gain performance over 32-entry IQ without any such overhead.

机译：在当今的大问题处理器中，指令调度窗口（也称为问题队列（IQ））的大小主要受到硬件复杂性的限制，以设计逻辑，从而限制了每个周期提取指令的扫描数量。级别并行（ILP）。为了加剧这些问题，取决于长等待时间加载操作的指令将继续驻留在IQ中，直到其源操作数准备就绪为止。因此，这种延迟的指令即使可能已经准备好执行，也会阻止任何新指令进入IQ。处理器和内存速度上日益增长的差异进一步加剧了从IQ移出指令的延迟。为了缓解该问题，在本文中，我们提出了一种新颖的技术，可以根据相关性链简化单独缓冲区中的指令。在等待长等待时间存储器操作提供源操作数的同时，每条指令都在父指令之后进行了简化。这些指令与IQ隔离开来，从而减轻了IQ压力，使潜在的可执行指令可以在管道中流动。我们对SPEC2000程序的分析表明，依赖于装载高速缓存未命中或其依赖的指令，通常已经准备好了第一个源操作数在智商中他们总等待时间的5-15％之内。基于这些观察，当长延迟存储器相关指令的第一个操作数就绪时，它们会被精简为有序缓冲区。在所提出的架构中，可以选择来自常规IQ和流线缓冲器头部的指令来执行，而唤醒逻辑复杂度与常规设计相同。我们的结果表明，FP和整数基准测试的32项IQ加上32个有序缓冲区的性能提速分别为15.7％和2％，与传统的64项IQ相当。尽管在唤醒逻辑的电路延迟复杂度方面存在较大开销，但64项IQ设计可以获得比32项IQ更高的性能，而流线缓冲器则可以在32项IQ之上获得性能而没有任何此类开销。

著录项

来源
《Microprocessors and microsystems》 |2008年第7期|p.375-385|共11页
作者
Hui Wang; Rama Sangireddy;
展开▼
作者单位

High Performance Dependable Computing Laboratory, Department of Electrical Engineering, University of Texas at Dallas, Richardson, TX 75083, USA;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
wide-issue processors; dynamic scheduling; out-of-order processing; long latency instructions; streamline buffers;

机译：宽问题处理器;动态调度;乱序处理;长等待时间指令;流水线缓冲器;

相似文献

外文文献
中文文献
专利

1. CG-OoO: Energy-Efficient Coarse-Grain Out-of-Order Execution Near In-Order Energy with Near Out-of-Order Performance [J] . Mohammadi Milad, Aamodt Tor M., Dally William J. ACM Transactions on Architecture and Code Optimization . 2017,第4期

机译：CG-OOO：在近无序性能的近期能源附近的节能粗粒粗粒子执行
2. Optimizing Instruction Scheduling through Combined In-Order and O-O-O Execution in SMT Processors [J] . Wang Hui, Sangireddy Rama, Baldawa Sandeep IEEE Transactions on Parallel and Distributed Systems . 2009,第3期

机译：通过SMT处理器中的有序和O-O-O组合执行来优化指令调度
3. Combining Variable Latency Pipeline with Instruction Reuse for Execution Latency Reduction [J] . Toshinori Sato, Itsujiro Arita Systems and Computers in Japan . 2003,第12期

机译：将可变延迟流水线与指令重用相结合以减少执行延迟
4. Student research poster: Software out-of-order execution for in-order architectures [C] . Kim-Anh Tran Proceedings of the 2016 International Conference on Parallel Architectures and Compilation . 2016

机译：学生研究海报：有序体系结构的软件无序执行
5. Braids: Out-of-order performance with almost in-order complexity. [D] . Tseng, Francis. 2007

机译：辫子：乱序的性能和几乎乱序的复杂性。
6. miCloud: A Plug-n-Play Extensible On-Premises Bioinformatics Cloud for Seamless Execution of Complex Next-Generation Sequencing Data Analysis Pipelines [O] . Baekdoo Kim, Thahmina Ali, Changsu Dong, -1

机译：miCloud：即插即用可扩展的本地生物信息学云用于无缝执行复杂的下一代测序数据分析管道
7. Student Research Poster: Software Out-of-Order Execution for In-Order Architectures [O] . Tran, Kim-Anh 2016

机译：学生研究海报：有序体系结构的软件无序执行

Streamlining Long Latency Instructions For Seamlessly Combined Out-of-order And In-order Execution

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅