【24h】

The Load Slice Core microarchitecture

机译:负载切片核心微体系结构

获取原文
获取原文并翻译 | 示例

摘要

Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have evolved from simple, in-order pipelines into complex, superscalar out-of-order designs. By extracting ILP, these processors also enable parallel cache and memory operations as a useful side-effect. Today, however, the growing off-chip memory wall and complex cache hierarchies of many-core processors make cache and memory accesses ever more costly. This increases the importance of extracting memory hierarchy parallelism (MHP), while reducing the net impact of more general, yet complex and power-hungry ILP-extraction techniques. In addition, for multi-core processors operating in power- and energy-constrained environments, energy-efficiency has largely replaced single-thread performance as the primary concern. Based on this observation, we propose a core microarchitecture that is aimed squarely at generating parallel accesses to the memory hierarchy while maximizing energy efficiency. The Load Slice Core extends the efficient in-order, stall-on-use core with a second in-order pipeline that enables memory accesses and address-generating instructions to bypass stalled instructions in the main pipeline. Backward program slices containing address-generating instructions leading up to loads and stores are extracted automatically by the hardware, using a novel iterative algorithm that requires no software support or recompilation. On average, the Load Slice Core improves performance over a baseline in-order processor by 53% with overheads of only 15% in area and 22% in power, leading to an increase in energy efficiency (MIPS/Watt) over in-order and out-of-order designs by 43% and over 4.7×, respectively. In addition, for a power- and area-constrained many-core design, the Load Slice Core outperforms both in-order and out-of-order designs, achieving a 53% and 95% higher performance, respectively, thus providing an alternative direction for future many-core processors.
机译:在暴露指令级并行性(ILP)的动机驱使下,微处理器内核已经从简单的有序流水线发展为复杂的超标量无序设计。通过提取ILP,这些处理器还可以启用并行缓存和内存操作,这是一个有用的副作用。但是,如今,片外内存墙的增长和许多核心处理器的复杂缓存层次结构使缓存和内存访问的成本越来越高。这增加了提取内存层次结构并行性(MHP)的重要性,同时减少了更通用,更复杂且耗电的ILP提取技术的净影响。此外,对于在功耗和能源受限的环境中运行的多核处理器,能源效率已大大取代了单线程性能,这是主要问题。基于此观察,我们提出了一个核心微体系结构,该体系结构旨在直接生成对内存层次结构的并行访问,同时最大限度地提高能源效率。负载切片核心通过第二个有序流水线扩展了有效的有序使用中断核心,该流水线使内存访问和地址生成指令能够绕过主流水线中的失速指令。使用不需要软件支持或重新编译的新颖的迭代算法,硬件会自动提取包含导致加载和存储的地址生成指令的后向程序片。平均而言,Load Slice Core将基准顺序处理器的性能提高了53%,而仅面积的15%的功耗和22%的功率的开销,导致了有序和高效的能源效率(MIPS / Watt)的提高。无序设计分别增加了43%和4.7倍以上。此外,对于功耗和面积受限的多核设计,Load Slice Core的性能优于有序设计和无序设计,分别实现了53%和95%的更高性能,从而提供了一个替代方向适用于未来的多核处理器。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号