首页> 外文学位 >Critical branches and lucky loads in Control-Independence architectures.
【24h】

Critical branches and lucky loads in Control-Independence architectures.

机译:独立控制体系结构中的关键分支和幸运负载。

获取原文
获取原文并翻译 | 示例

摘要

Branch mispredicts have a first-order impact on the performance of integer applications. Control Independence (CI) architectures aim to overlap the penalties of mispredicted branches with useful execution by spawning control-independent work as separate threads. Although control independent, such threads may consume register and memory values produced by preceding threads. This dissertation presents an efficient and complexity-effective mechanism for synchronizing inter-thread register dependences in CI architectures. The mechanism is further extended to synchronize memory dependences by treating store-set IDs as analogous to architectural registers.;The performance of CI architectures is limited by mispredicted branches that have data-flow dependences crossing thread boundaries, called critical branches. Critical branches increase the average mispredict penalties suffered by CI architectures, resulting in a decrease in correct-path instruction bandwidth. I propose hardware mechanisms that alleviate the critical branch problem. First, I modify the register synchronization mechanism to remove false dependences that arise from saves and restores to callee-saved registers. Second, I find that the store-set predictor introduces dependences between loads and stores that alias quite infrequently; I modify the memory synchronization mechanism so that it finds the right balance between synchronization and speculation on a per-load granularity. Finally, I show that other mechanisms, such as a critical branch aware spawn policy, can also alleviate the performance loss from critical branches. As a result of these optimizations, a four-core CI architecture is able to attain a speedup of up to 90% over a single core, although with an aggressive and hard-to-implement memory backend that performs associative searches across the whole queue.;I find that a CI processor that uses the more implementable and widely accepted cache-coherence-based disambiguation and forwarding (CC-DF) suffers a severe slowdown. The CI processor also suffers a large performance hit when using a recently proposed mechanism for disambiguation, called Bulk. In both these cases, most of the performance reduction can be attributed to a small set of instructions across which the memory synchronization mechanism deemed it profitable to speculate, called lucky loads. Because of lucky loads, the performance of CI processors is extremely sensitive to the mechanism used for inter-thread forwarding and disambiguation. With a conservative memory backend like CC-DF or Bulk, the adaptive memory synchronization mechanism is forced to become less speculative and synchronizes a larger fraction of loads, thereby reducing performance.;I perform a thorough analysis of the performance sensitivity of CI processors to disambiguation and forwarding. The insights from this analysis are used to drive the design of hardware mechanisms to perform these two functions that are low in complexity and yet attain high performance. The basic premise behind these mechanisms is to use small caches to perform early disambiguation and forwarding. These caches are not responsible for ensuring correctness; they merely enable high performance in the presence of lucky loads. The caches are backed up by a simple load re-execution mechanism that guarantees correctness. I find that the performance of a CI processor with small (32-entry and 128-entry) structures for disambiguation and forwarding, respectively, is within 10% of global load and store queues in the worst case.
机译:分支错误预测对整数应用程序的性能具有一阶影响。控制独立(CI)架构旨在通过产生独立于控制的工作作为单独的线程,使错误预测的分支的惩罚与有用的执行重叠。尽管独立于控制,但此类线程可能会消耗先前线程产生的寄存器和内存值。本文提出了一种用于同步CI体系结构中线程间寄存器依赖关系的有效且复杂度高的机制。通过将存储集ID视为类似于体系结构寄存器,该机制进一步扩展为同步内存依赖关系。CI体系结构的性能受到错误预测的分支的限制,这些分支具有跨线程边界的数据流依赖关系,称为关键分支。关键分支增加了CI体系结构遭受的平均错误预测损失,从而导致正确路径指令带宽的减少。我提出了减轻关键分支问题的硬件机制。首先,我修改了寄存器同步机制,以消除由于保存和恢复到被调用方保存的寄存器而引起的错误依赖性。其次,我发现存储集预测变量引入了负载之间的依赖关系,并且很少使用该别名。我修改了内存同步机制,以便在每个负载粒度上的同步和推测之间找到合适的平衡。最后,我证明了其他机制,例如关键分支感知的生成策略,也可以减轻关键分支的性能损失。这些优化的结果是,尽管具有积极进取且难以实现的内存后端在整个队列中执行关联搜索,但四核CI架构能够在单个内核上实现高达90%的加速。 ;我发现使用更易实现且广为接受的基于缓存一致性的消歧和转发(CC-DF)的CI处理器会严重减速。当使用最近提出的用于消除歧义的机制Bulk时,CI处理器的性能也会受到很大影响。在这两种情况下,大多数性能下降都可归因于一小部分指令,存储器同步机制认为这套指令有利可图,这被称为幸运负载。由于幸运的负载,CI处理器的性能对用于线程间转发和歧义消除的机制极为敏感。对于像CC-DF或Bulk这样的保守内存后端,自适应内存同步机制被迫减少投机,并同步更大部分的负载,从而降低性能。我对CI处理器对歧义消除的性能敏感性进行了全面分析。和转发。来自此分析的见解可用于驱动硬件机制的设计,以执行这两个功能,这些功能的复杂度较低,但性能却很高。这些机制背后的基本前提是使用小型缓存执行早期的歧义消除和转发。这些缓存不负责确保正确性。它们仅在幸运负载存在的情况下才能实现高性能。高速缓存通过保证正确性的简单加载重新执行机制进行备份。我发现,在最坏的情况下,具有用于消歧和转发的小型(32项和128项)结构的CI处理器的性能分别不到全局装入和存储队列的10%。

著录项

  • 作者

    Malik, Kshitiz.;

  • 作者单位

    University of Illinois at Urbana-Champaign.;

  • 授予单位 University of Illinois at Urbana-Champaign.;
  • 学科 Engineering Electronics and Electrical.
  • 学位 Ph.D.
  • 年度 2009
  • 页码 148 p.
  • 总页数 148
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号