首页> 外文学位 >Improving Scalability of Chip-MultiProcessors with Many HW ACCelerators
【24h】

Improving Scalability of Chip-MultiProcessors with Many HW ACCelerators

机译:使用许多硬件加速器提高芯片多处理器的可扩展性

获取原文
获取原文并翻译 | 示例

摘要

Breakthrough streaming applications such as virtual reality, augmented reality, autonomous vehicles, and multimedia demand for high-performance and power-efficient computing. In response to this ever-increasing demand, manufactures look beyond the parallelism available in Chip Multi- Processors (CMPs), and more toward application-specific designs. In this regard, ACCelerator (ACC)-based heterogeneous CMPs (ACMs) have emerged as a promising platform.;An ACMP combines application-specific HW ACCelerators (ACCs) with General Purpose Processor(s) (GPP) onto a single chip. ACCs are customized to provide high-performance and power-efficient computing for specific compute-intensive functions and GPP(s) runs the remaining functions and controls the whole system. In ACMP platforms, ACCs achieve performance and power benefits at the expense of reduced flexibility and generality for running different workloads. Therefore, manufactures must utilize several ACCs to target a diverse set of workloads within a given application domain.;However, our observation shows that conventional ACMP architectures with many ACCs have scalability limitations. The ACCs benefits in processing power can be overshadowed by bottlenecks on shared resources of processor core(s), communication fabric/DMA, and on-chip memory. The primary source of the resources bottlenecks stems from ACCs data access and orchestration load. Due to very loosely defined semantics for communication with ACCs, and relying upon general platform architectures, the resources bottlenecks hamper performance.;This dissertation explores and alleviates the scalability limitations of ACMPs. To this end, the dissertation first proposes an analytical model to holistically explore how bottlenecks emerge on shared resources with increasing number of ACCs. Afterward, it proposes ACMPerf, an analytical model to capture the impact of the resources bottlenecks on the achievable ACCs' benefits.;Then, to open a path toward more scalable integration of ACCs, the dissertation identifies and formalizes ACC communication semantics. The semantics describe four primary aspects: data access, synchronization, data granularity, and data marshalling.;Considering our identified ACC communication semantics, and improving upon conventional ACMP architectures, the dissertation proposes a novel architecture of Transparent Self- Synchronizing ACCs (TSS). TSS efficiently realizes our identified communication semantics of direct ACC-to-ACC connections often occurring in streaming applications. The proposed TSS adds autonomy to ACCs to locally handle the semantic aspects of data granularity, data marshalling and synchronization. It also exploits a local interconnect among ACCs to tackle the semantics aspect of data access. As TSS gives autonomy to ACCs to self-synchronize and self-orchestrate each other independent of the processor, thereby enabling finest data granularity to reduce the pressure on the shared memory. TSS also exploits a local and reconfigurable interconnect for direct data transfer among ACCs without occupying DMA and communication fabric.;As a result of reducing the overhead of direct ACC-to-ACC connections, TSS delivers more of the ACCs' benefits than that of conventional ACMP architectures: up to 130x higher throughput and 209x lower energy, all as results of up to 78x reduction in the imposed load to the shared resources.
机译:虚拟现实,增强现实,自动驾驶汽车和多媒体对高性能和高能效计算的突破性流应用程序。为了满足这种不断增长的需求,制造商的眼光超出了芯片多处理器(CMP)中可用的并行性,而更多地转向了专用设计。在这方面,基于ACCelerator(ACC)的异构CMP(ACM)已经成为一个有前途的平台。ACMP将专用的HW ACCelerators(ACC)与通用处理器(GPP)组合到单个芯片上。定制了ACC,以为特定的计算密集型功能提供高性能和高能效的计算,而GPP运行其余功能并控制整个系统。在ACMP平台中,ACC以降低灵活性和通用性为代价来实现性能和功耗优势,以运行不同的工作负载。因此,制造商必须利用多个ACC来针对给定应用程序域中的各种工作负载。;但是,我们的观察结果表明,具有许多ACC的常规ACMP体系结构具有可伸缩性限制。 ACC在处理能力上的优势可能会被处理器内核,通信结构/ DMA和片上存储器的共享资源上的瓶颈所掩盖。资源瓶颈的主要来源来自ACC的数据访问和业务流程负载。由于用于与ACC进行通信的定义语义非常宽松,并且依赖于通用平台体系结构,因此资源瓶颈会阻碍性能。;本文旨在探索和缓解ACMP的可扩展性限制。为此,本文首先提出了一种分析模型,从整体上探讨了随着ACC数量的增加,共享资源上如何出现瓶颈。随后,提出了一种ACMPerf分析模型,该模型可以捕获资源瓶颈对可实现的ACC收益的影响。然后,为开辟一条通往ACC更具可扩展性的集成之路,本论文确定并正式化了ACC通信语义。语义描述了四个主要方面:数据访问,同步,数据粒度和数据编组。考虑到我们确定的ACC通信语义,并在对常规ACMP体系结构进行改进的基础上,提出了一种透明的自同步ACC(TSS)体系结构。 TSS有效地实现了我们确定的直接在流应用程序中发生的直接ACC到ACC连接的通信语义。提议的TSS为ACC增加了自治性,以本地处理数据粒度,数据编组和同步的语义方面。它还利用ACC之间的本地互连来解决数据访问的语义方面。由于TSS赋予ACC自主权,使其独立于处理器而彼此自同步和自协调,从而实现了最佳的数据粒度,从而减轻了对共享内存的压力。 TSS还利用本地和可重新配置的互连在ACC之间进行直接数据传输而又不占用DMA和通信结构。由于减少了直接ACC到ACC连接的开销,TSS提供了比传统ACC更多的好处ACMP体系结构:吞吐量提高多达130倍,能耗降低了209倍,所有这些都是对共享资源施加的负载降低了多达78倍的结果。

著录项

  • 作者

    Teimouri, Nasibeh.;

  • 作者单位

    Northeastern University.;

  • 授予单位 Northeastern University.;
  • 学科 Computer engineering.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 131 p.
  • 总页数 131
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号