首页> 外文学位 >Resource management techniques for performance and energy efficiency in multithreaded processors.
【24h】

Resource management techniques for performance and energy efficiency in multithreaded processors.

机译:用于多线程处理器中性能和能源效率的资源管理技术。

获取原文
获取原文并翻译 | 示例

摘要

Microprocessor designers that once favored the aggressive extraction of deeper Instruction Level Parallelism (ILP) from a single application have recently diverted their attention to the architectures that harvest parallelism across multiple threads of control, or Thread-Level Parallelism (TLP). This shift in paradigm has come in the light of new design challenges such as larger wire delays, escalating complexities, increasing levels of power dissipation, and higher operating temperatures. One design paradigm that exploits TLP is Simultaneous Multithreading (SMT), where multiple threads of control execute together on a slightly enhanced superscalar processor core and share its key datapath resources. As the number of transistors available on a chip will continue to increase in future technologies, it is likely that a higher degree of multithreading will be supported within each processor core. Therefore, it is important to consider techniques for increasing the efficiency of SMT-enabled cores and it is precisely the goal of this dissertation to propose and investigate such solutions.;We begin by examining they key shared datapath resources, namely the issue queue (IQ) and register files (RF). For the IQ, we first propose instruction packing---a technique which opportunistically places two instructions into the same IQ entry provided that each of these instructions has at most one non-ready source operand at the time of dispatch. Instruction packing results in a 40% reduction in the IQ power and 26% reduction in the wakeup delay at the cost of only 0.6% performance for a 4-threaded SMT machine. We then take the ideas behind instruction packing one step further and propose the 2OP_BLOCK scheduler---a scheduling technique that completely disallows the dispatch of instructions with 2 non-ready sources, thus significantly simplifying the IQ logic. This mechanism works well for SMTs because it often allows the reuse of the same IQ entry multiple times for the instructions with no more than one non-ready source rather than tying up the entry with an instruction with 2 non-ready sources (which typically spend a longer time in the queue). The 2OP_BLOCK design applied to a 4-threaded SMT with a 32-entry scheduler provides a 33% increase in throughput and 27% improvement in fairness.;Our next technique addresses the bottleneck associated with another key shared resource of the SMT datapath---the Physical Register File (RF). We propose a novel mechanism for early deallocation of physical registers to increase the register file efficiency and provide higher performance for the same number of registers by exploiting two fundamental trends in multithread processor design: (a) increasing memory access latencies and, (b) relatively higher number of L2 cache misses due to cache sharing effects. Applied to a 4-threaded SMT machine with 256 integer and 256 floating point registers (for the combined 512 registers), our technique provides additional gains of 33% (25%) on top of the DCRA mechanism, 38% (26%) on top of the Hill-Climbing technique, and 51% (48%) on top of the ICOUNT fetching policy in terms of the throughput IPC (fairness metric). Our technique is unique in that it does not incur tag re-broadcasts, register re-mappings, associative searches, rename table modifications or register file checkpoints, does not require per register consumer counters and requires no additional storage within the datapath. Instead, it relies on a simple off-the-critical-path logic at the back end of the pipeline to identify the early deallocation opportunities and save the values of the early deallocated registers for precise state reconstruction.;Finally, we show that there are complex interactions between the shared and private per-thread resources in an SMT processor and that these interactions need to be fully considered to understand the nuances of SMT architectures and to realize the full performance potential of multithreading. We show that without such an understanding, unexpected phenomenon may occur. For example, an across-the-board increase in the size of the per-thread reorder buffers often decreases the instruction throughput on SMT due to the excessive pressure on the shared SMT resources such as the issue queue and the register file. We propose mechanisms and the underlying ROB organization to dynamically adapt the number of ROB entries allocated to threads only when such adaptations do not result in increased pressure on the shared datapath resources. Our studies show that such dynamic adaptation of the ROBs results in significant increases on top of the DCRA resource allocation policy in terms of both throughput (54% compared to similarly-sized static ROBs and 21% compared to the best-performing static configuration) and fairness (29% and 10% respectively). We also demonstrated that the performance of adaptive ROBs approaches that of the datapath with an infinite issue queue, thus completely eliminating the size-effects of ROB scaling on the shared issue queue and obviating the need for more complex ROB management mechanisms.
机译:曾经支持从单个应用程序中主动提取更深的指令级并行性(ILP)的微处理器设计人员最近将注意力转移到了在多个控制线程或线程级并行性(TLP)上获取并行性的体系结构。这种范式的转变是针对新的设计挑战而提出的,例如更大的布线延迟,不断升级的复杂性,不断增加的功耗水平以及更高的工作温度。利用TLP的一种设计范例是同时多线程(SMT),其中多个控制线程在略微增强的超标量处理器内核上一起执行并共享其关键数据路径资源。随着未来技术中芯片上可用晶体管的数量将继续增加,每个处理器内核可能会支持更高程度的多线程。因此,考虑提高SMT核心效率的技术很重要,而正是本文的目标是提出并研究此类解决方案。我们首先研究它们的关键共享数据路径资源,即问题队列(IQ)。 )并注册文件(RF)。对于IQ,我们首先提出指令打包-一种将两个指令机会性地放入同一IQ条目的技术,条件是每个指令在分派时最多具有一个未就绪的源操作数。指令打包使4线程SMT机器的IQ功率降低40%,唤醒延迟降低26%,而性能仅为0.6%。然后,我们将指令打包背后的思想又向前推进了一步,提出了2OP_BLOCK调度程序-一种调度技术,该调度技术完全禁止使用2个非就绪源进行指令调度,从而显着简化了IQ逻辑。这种机制对SMT效果很好,因为它通常允许不超过一个未就绪源的指令多次重复使用同一IQ条目,而不是使用带有2个未就绪源的指令来对条目进行绑定(通常花费排队时间更长)。将2OP_BLOCK设计应用于具有32个条目调度程序的4线程SMT可将吞吐量提高33%,将公平性提高27%。;我们的下一个技术解决了与SMT数据路径的另一个关键共享资源相关的瓶颈-物理寄存器文件(RF)。我们提出了一种用于早期释放物理寄存器的新颖机制,以通过利用多线程处理器设计中的两个基本趋势来提高寄存器文件效率并为相同数量的寄存器提供更高的性能:(a)增加内存访问延迟,以及(b)相对由于高速缓存共享效应,L2高速缓存未命中的数量增加了。应用于具有256个整数和256个浮点寄存器(用于512个寄存器的组合)的4线程SMT机器,我们的技术在DCRA机制之上提供了33%(25%)的额外增益,在DCRA机制之上提供了38%(26%)的额外增益。就吞吐量IPC(公平性指标)而言,在“爬山”技术中排名最高,在ICOUNT提取策略中排名第一(51%(48%))。我们的技术是独特的,因为它不会导致标签重新广播,寄存器重新映射,关联搜索,重命名表修改或寄存器文件检查点,不需要每个寄存器使用者计数器,并且不需要数据路径内的其他存储。取而代之的是,它依靠流水线后端的简单关键路径逻辑来识别早期释放的机会,并保存早期释放的寄存器的值以进行精确的状态重建。 SMT处理器中共享和专用每线程资源之间的复杂交互,需要充分考虑这些交互,以了解SMT体系结构的细微差别并实现多线程的全部性能潜力。我们表明,如果没有这种理解,可能会发生意外的现象。例如,由于对共享SMT资源(例如发布队列和寄存器文件)的压力过大,每个线程重排序缓冲区的大小全面增加通常会降低SMT上的指令吞吐量。我们提出了一些机制和底层的ROB组织,以仅在此类调整不会导致共享数据路径资源的压力增加时动态地调整分配给线程的ROB条目的数量。我们的研究表明,在DCRA资源分配策略的基础上,ROB的这种动态适应导致吞吐量(与类似大小的静态ROB相比增加了54%,与性能最佳的静态配置相比增加了21%)以及公平性(分别为29%和10%)。我们还证明了自适应ROB的性能接近具有无限问题队列的数据路径的性能,从而完全消除了ROB扩展对共享问题队列的大小影响,并且消除了对更复杂的ROB管理机制的需求。

著录项

  • 作者

    Sharkey, Joseph James.;

  • 作者单位

    State University of New York at Binghamton.;

  • 授予单位 State University of New York at Binghamton.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 195 p.
  • 总页数 195
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 水产、渔业;
  • 关键词

  • 入库时间 2022-08-17 11:40:46

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号