首页> 外文OA文献 >Tradeoffs in Designing Massively Parallel Accelerator Architectures
【2h】

Tradeoffs in Designing Massively Parallel Accelerator Architectures

机译:设计大规模并行加速器架构的权衡

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

There is a large, emerging, and commercially relevant class of applications which stands to be enabled by a significant increase in parallel computing throughput. Moreover, continued scaling of semiconductor technology allows us the creation of architectures with tremendous throughput on a single chip. In this thesis, we examine the confluence of these emerging single-chip accelerators and the applications they enable. We examine the tradeoffs associated with accelerator architectures, working our way down the abstraction hierarchy of computing starting at the application level and concluding with the physical design of the circuits.Research into accelerator architectures is hampered by the lack of standardized, readily available benchmarks. Among these applications is what we refer to as visualization, interaction, and simulation (VIS). These applications are ideally suited for accelerators because of their parallelizability and demand for high throughput. We present VISBench, a benchmark suite to serve as an experimental proxy for for VIS applications. VISBench contains a sampling of applications and application kernels from traditional visual computing areas such as graphics rendering and video encoding. It also contains a sampling of emerging application areas, such as computer vision and physics simulation, which are expected to drive the development of future accelerator architectures.We use VISBench to examine some important high level decisions for an accelerator architecture. We propose a methodology to evaluate performance tradeoffs against chip area. We propose a memory system based on a cache incoherent shared address space along with mechanisms to provide synchronization and communication. We also examine GPU-style SIMD execution and find that a MIMD architecture is necessary to provide strong performance per area for some applications.We analyze area versus performance tradeoffs in architecting the individual cores. We find that a design made of small, simple cores achieves much higher throughput than a general purpose uniprocessor. Further, we find that a limited amount of support for ILP within each core aids overall performance. We find that fine-grained multithreading improves performance, but only up to a point. We find that vector ALUs for SIMD instruction sets provide a poor performance to area ratio.We propose a methodology for performing an integrated optimization of both the micro-architecture and the physical circuit design of the cores and caches. In our approach, we use statistical sampling of the design space for evaluating the performance of the micro-architecture and RTL synthesis to characterize the area-power-delay of the underlying circuits. This integrated methodology enables a much more powerful analysis of the performance-area and performance-power tradeoffs for the low level micro-architecture. We use this methodology to find the optimal design points for an accelerator architecture under area constraints and power constraints. Our results indicate that more complex architectures scale well in terms of performance per area, but that the addition of a power constraint favors simpler architectures.
机译:并行计算吞吐量的显着提高将使大型,新兴且与商业相关的应用类别成为可能。此外,半导体技术的不断扩展使我们能够在单个芯片上创建具有巨大吞吐量的架构。在本文中,我们研究了这些新兴的单芯片加速器及其启用的应用的融合。我们研究了与加速器架构相关的权衡,从应用程序级别一直到计算的抽象层次,一直到电路的物理设计,一直沿下去,一直到计算的抽象层次。在这些应用程序中,我们称为可视化,交互和模拟(VIS)。这些应用程序具有并行性和对高吞吐量的需求,因此非常适合加速器。我们提供VISBench,这是一个基准套件,可作为VIS应用程序的实验代理。 VISBench包含来自传统视觉计算领域(例如图形渲染和视频编码)的应用程序和应用程序内核的样本。它还包含了新兴应用领域的样本,例如计算机视觉和物理模拟,这些领域有望推动未来加速器体系结构的发展。我们使用VISBench检查加速器体系结构的一些重要的高层决策。我们提出一种方法来评估芯片面积上的性能折衷。我们提出了一种基于缓存不相关的共享地址空间以及提供同步和通信机制的存储系统。我们还检查了GPU风格的SIMD执行情况,发现MIMD架构对于为某些应用程序提供单位面积的强大性能是必不可少的。我们在构建单个内核时分析了面积与性能之间的权衡。我们发现,由小型,简单内核构成的设计比通用单处理器具有更高的吞吐量。此外,我们发现每个内核中对ILP的有限支持有助于整体性能。我们发现,细粒度的多线程可以提高性能,但只能提高一点。我们发现用于SIMD指令集的向量ALU提供了较差的性能与面积之比。我们提出了一种对微体系结构以及内核和缓存的物理电路设计进行集成优化的方法。在我们的方法中,我们使用设计空间的统计采样来评估微体系结构的性能和RTL综合,以表征底层电路的面积功耗延迟。这种集成的方法可以对底层微体系结构的性能区域和性能功率折衷进行更强大的分析。我们使用这种方法来找到在面积限制和功率限制下加速器架构的最佳设计点。我们的结果表明,更复杂的体系结构在每个区域的性能方面都可以很好地扩展,但是增加功率约束则倾向于更简单的体系结构。

著录项

  • 作者

    Mahesri Aqeel;

  • 作者单位
  • 年度 2009
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"en","name":"English","id":9}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号