首页> 外文学位 >Improving Scalability of Chip-MultiProcessors with Many HW ACCelerators

【24h】

Improving Scalability of Chip-MultiProcessors with Many HW ACCelerators

机译：使用许多硬件加速器提高芯片多处理器的可扩展性

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Breakthrough streaming applications such as virtual reality, augmented reality, autonomous vehicles, and multimedia demand for high-performance and power-efficient computing. In response to this ever-increasing demand, manufactures look beyond the parallelism available in Chip Multi- Processors (CMPs), and more toward application-specific designs. In this regard, ACCelerator (ACC)-based heterogeneous CMPs (ACMs) have emerged as a promising platform.;An ACMP combines application-specific HW ACCelerators (ACCs) with General Purpose Processor(s) (GPP) onto a single chip. ACCs are customized to provide high-performance and power-efficient computing for specific compute-intensive functions and GPP(s) runs the remaining functions and controls the whole system. In ACMP platforms, ACCs achieve performance and power benefits at the expense of reduced flexibility and generality for running different workloads. Therefore, manufactures must utilize several ACCs to target a diverse set of workloads within a given application domain.;However, our observation shows that conventional ACMP architectures with many ACCs have scalability limitations. The ACCs benefits in processing power can be overshadowed by bottlenecks on shared resources of processor core(s), communication fabric/DMA, and on-chip memory. The primary source of the resources bottlenecks stems from ACCs data access and orchestration load. Due to very loosely defined semantics for communication with ACCs, and relying upon general platform architectures, the resources bottlenecks hamper performance.;This dissertation explores and alleviates the scalability limitations of ACMPs. To this end, the dissertation first proposes an analytical model to holistically explore how bottlenecks emerge on shared resources with increasing number of ACCs. Afterward, it proposes ACMPerf, an analytical model to capture the impact of the resources bottlenecks on the achievable ACCs' benefits.;Then, to open a path toward more scalable integration of ACCs, the dissertation identifies and formalizes ACC communication semantics. The semantics describe four primary aspects: data access, synchronization, data granularity, and data marshalling.;Considering our identified ACC communication semantics, and improving upon conventional ACMP architectures, the dissertation proposes a novel architecture of Transparent Self- Synchronizing ACCs (TSS). TSS efficiently realizes our identified communication semantics of direct ACC-to-ACC connections often occurring in streaming applications. The proposed TSS adds autonomy to ACCs to locally handle the semantic aspects of data granularity, data marshalling and synchronization. It also exploits a local interconnect among ACCs to tackle the semantics aspect of data access. As TSS gives autonomy to ACCs to self-synchronize and self-orchestrate each other independent of the processor, thereby enabling finest data granularity to reduce the pressure on the shared memory. TSS also exploits a local and reconfigurable interconnect for direct data transfer among ACCs without occupying DMA and communication fabric.;As a result of reducing the overhead of direct ACC-to-ACC connections, TSS delivers more of the ACCs' benefits than that of conventional ACMP architectures: up to 130x higher throughput and 209x lower energy, all as results of up to 78x reduction in the imposed load to the shared resources.

机译：虚拟现实，增强现实，自动驾驶汽车和多媒体对高性能和高能效计算的突破性流应用程序。为了满足这种不断增长的需求，制造商的眼光超出了芯片多处理器（CMP）中可用的并行性，而更多地转向了专用设计。在这方面，基于ACCelerator（ACC）的异构CMP（ACM）已经成为一个有前途的平台。ACMP将专用的HW ACCelerators（ACC）与通用处理器（GPP）组合到单个芯片上。定制了ACC，以为特定的计算密集型功能提供高性能和高能效的计算，而GPP运行其余功能并控制整个系统。在ACMP平台中，ACC以降低灵活性和通用性为代价来实现性能和功耗优势，以运行不同的工作负载。因此，制造商必须利用多个ACC来针对给定应用程序域中的各种工作负载。；但是，我们的观察结果表明，具有许多ACC的常规ACMP体系结构具有可伸缩性限制。 ACC在处理能力上的优势可能会被处理器内核，通信结构/ DMA和片上存储器的共享资源上的瓶颈所掩盖。资源瓶颈的主要来源来自ACC的数据访问和业务流程负载。由于用于与ACC进行通信的定义语义非常宽松，并且依赖于通用平台体系结构，因此资源瓶颈会阻碍性能。;本文旨在探索和缓解ACMP的可扩展性限制。为此，本文首先提出了一种分析模型，从整体上探讨了随着ACC数量的增加，共享资源上如何出现瓶颈。随后，提出了一种ACMPerf分析模型，该模型可以捕获资源瓶颈对可实现的ACC收益的影响。然后，为开辟一条通往ACC更具可扩展性的集成之路，本论文确定并正式化了ACC通信语义。语义描述了四个主要方面：数据访问，同步，数据粒度和数据编组。考虑到我们确定的ACC通信语义，并在对常规ACMP体系结构进行改进的基础上，提出了一种透明的自同步ACC（TSS）体系结构。 TSS有效地实现了我们确定的直接在流应用程序中发生的直接ACC到ACC连接的通信语义。提议的TSS为ACC增加了自治性，以本地处理数据粒度，数据编组和同步的语义方面。它还利用ACC之间的本地互连来解决数据访问的语义方面。由于TSS赋予ACC自主权，使其独立于处理器而彼此自同步和自协调，从而实现了最佳的数据粒度，从而减轻了对共享内存的压力。 TSS还利用本地和可重新配置的互连在ACC之间进行直接数据传输而又不占用DMA和通信结构。由于减少了直接ACC到ACC连接的开销，TSS提供了比传统ACC更多的好处ACMP体系结构：吞吐量提高多达130倍，能耗降低了209倍，所有这些都是对共享资源施加的负载降低了多达78倍的结果。

著录项

作者
Teimouri, Nasibeh.;
展开▼
作者单位

Northeastern University.;

展开▼
授予单位 Northeastern University.;
学科 Computer engineering.
学位 Ph.D.
年度 2017
页码 131 p.
总页数 131
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Architecture Support for Tightly-Coupled Multi-Core Clusters with Shared-Memory HW Accelerators [J] . Dehyadegari Masoud, Marongiu Andrea, Kakoee Mohammad Reza, Computers, IEEE Transactions on . 2015,第8期

机译：具有共享内存硬件加速器的紧密耦合多核群集的体系结构支持
2. Efficient Power Gating of SIMD Accelerators Through Dynamic Selective Devectorization in an HW/SW Codesigned Environment [J] . Kumar Rakesh, Martinez Alejandro, Gonzalez Antonio ACM Transactions on Architecture and Code Optimization . 2014,第3期

机译：在硬件/软件代码签名环境中通过动态选择性去矢量化对SIMD加速器进行有效的功率门控
3. Using GPCE principles for hardware systems and accelerators: (bridging the gap to HW design) [J] . Nikhil Rishiyur S. ACM SIGPLAN Notices: A Monthly Publication of the Special Interest Group on Programming Languages . 2010,第2期

机译：将GPCE原理用于硬件系统和加速器：（弥合硬件设计的空白）
4. SW and HW co-design of Connect6 accelerator with scalable streaming cores [C] . Sano Kentaro 2011 International Conference on Field-Programmable Technology . 2011

机译：具有可扩展流核心的Connect6加速器的软件和硬件协同设计
5. Improving Programming Support for Hardware Accelerators through Automata Processing Abstractions [D] . Angstadt, Kevin A. 2020

机译：通过自动处理抽象改进硬件加速器的编程支持
6. Harmonizing scientific rigor with political urgency: policy learnings for identifying accelerators for scale-up from the safe childbirth checklist programme in Rajasthan India [O] . Somesh Kumar, Priti Dave, Ashish Srivastava, 2019

机译：使政治严谨与政治紧迫性协调一致：从印度拉贾斯坦邦的安全分娩清单计划中识别加速扩大的政策学习
7. Towards Open-HW: A Platform to Design, Share and Deploy FPGA Accelerators in Low Cost [O] . Qian Zhao, Motoki Amagasaki, Masahiro Iida, 2017

机译：向开放式HW：以低成本为设计，共享和部署FPGA加速器的平台

Improving Scalability of Chip-MultiProcessors with Many HW ACCelerators

摘要

著录项

相似文献

相关主题

期刊订阅