首页> 外文OA文献 >Achieving Superscalar Performance without Superscalar Overheads - A Dataflow Compiler IR for Custom Computing
【2h】

Achieving Superscalar Performance without Superscalar Overheads - A Dataflow Compiler IR for Custom Computing

机译:在没有超标量开销的情况下实现超标量性能 - 用于自定义计算的数据流编译器IR

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The difficulty of effectively parallelizing code for multicore processors, combined with the end of threshold voltage scaling has resulted in the problem of u27Dark Siliconu27, severely limiting performance scaling despite Mooreu27s Law. To address dark silicon, not only must we drastically improve the energy efficiency of computation, but due to Amdahlu27s Law, we must do so without compromising sequential performance. Designers increasingly utilize custom hardware to dramatically improve both efficiency and performance in increasingly heterogeneous architectures. Unfortunately, while it efficiently accelerates numeric, data-parallel applications, custom hardware often exhibits poor performance on sequential code, so complex, power-hungry superscalar processors must still be utilized. This paper addresses the problem of improving sequential performance in custom hardware by (a) switching from a statically scheduled to a dynamically scheduled (dataflow) execution model, and (b) developing a new compiler IR for high-level synthesis that enables aggressive exposition of ILP even in the presence of complex control flow. This new IR is directly implemented as a static dataflow graph in hardware by our high-level synthesis tool-chain, and shows an average speedup of 1.13 times over equivalent hardware generated using LegUp, an existing HLS tool. In addition, our new IR allows us to further trade area & energy for performance, increasing the average speedup to 1.55 times, through loop unrolling, with a peak speedup of 4.05 times. Our custom hardware is able to approach the sequential cycle-counts of an Intel Nehalem Core i7 superscalar processor, while consuming on average only 0.25 times the energy of an in-order Altera Nios IIf processor.
机译:有效地并行化多核处理器的代码的困难,再加上阈值电压缩放的结束,导致了“硅芯片”问题,尽管摩尔定律却严重限制了性能缩放。为了解决深色硅的问题,我们不仅必须大大提高计算的能效,而且由于阿姆达尔定律,我们必须在不影响顺序性能的情况下做到这一点。设计人员越来越多地使用定制硬件,以在日益异构的体系结构中显着提高效率和性能。不幸的是,自定义硬件虽然可以有效地加速数字数据并行应用程序,但是在顺序代码上通常表现出较差的性能,因此仍然必须使用复杂的,耗电的超标量处理器。本文解决了以下问题,即通过(a)从静态调度的执行模型切换到动态调度的(数据流)执行模型,以及(b)开发用于高级综合的新编译器IR来实现自定义硬件的顺序性能的问题,从而可以积极地阐述即使存在复杂的控制流,ILP也可以。这个新的IR由我们的高级综合工具链直接实现为硬件中的静态数据流图,并且显示出比使用现有HLS工具LegUp生成的等效硬件平均提高了1.13倍。此外,我们的新IR允许我们进一步权衡面积和能量以提高性能,通过循环展开将平均速度提高到1.55倍,峰值速度达到4.05倍。我们的定制硬件能够逼近Intel Nehalem Core i7超标量处理器的顺序周期数,而平均能耗仅为顺序Altera Nios IIf处理器的0.25倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号