Maestro: A Memory-on-Logic Architecture for Coordinated Parallel Use of Many Systolic Arrays

机译：Maestro：协调并行使用许多脉动阵列的逻辑存储器

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present the Maestro memory-on-logic 3D-IC architecture for coordinated parallel use of a plurality of systolic arrays (SAs) in performing deep neural network (DNN) inference. Maestro reduces under-utilization common for a single large SA by allowing parallel use of many smaller SAs on DNN weight matrices of varying shapes and sizes. In order to buffer immediate results in memory blocks (MBs) and provide coordinated high-bandwidth communication between SAs and MBs in transferring weights and results Maestro employs three innovations. (1) An SA on the logic die can access its corresponding MB on the memory die in short distance using 3D-IC interconnects, (2) through an efficient switch based on H-trees, an SA can access any MB with low latency, and (3) the switch can combine partial results from SAs in an elementwise fashion before writing back to a destination MB. We describe the Maestro architecture, including a circuit and layout design, detail scheduling of the switch, analyze system performance for real-time inference applications using input with batch size equal to one, and showcase applications for deep learning inference, with ShiftNet for computer vision and recent Transformer models for natural language processing. For the same total number of systolic cells, Maestro, with multiple smaller SAs, leads to 16x and 12x latency improvements over a single large SA on ShiftNet and Transformer, respectively. Compared to a floating-point GPU implementation of ShiftNet and Transform, a baseline Maestro system with 4,096 SAs (each with 8x8 systolic cells) provides significant latency improvements of 30x and 47x, respectively.

机译：我们提出了Maestro逻辑存储3D-IC体系结构，用于在执行深度神经网络（DNN）推理时协调并行使用多个脉动阵列（SA）。 Maestro通过允许在形状和大小不同的DNN权重矩阵上并行使用许多较小的SA，来减少单个大型SA常见的利用率不足问题。为了在存储块（MB）中缓冲立即结果并在SA和MB之间提供协调的高带宽通信以传递权重，结果Maestro采用了三项创新。（1）逻辑芯片上的SA可以使用3D-IC互连在短距离内访问其存储器芯片上的相应MB，（2）通过基于H树的高效开关，SA可以访问低延迟的任何MB，（3）交换机可以在写回目标MB之前，以元素方式合并来自SA的部分结果。我们描述了Maestro架构，包括电路和布局设计，开关的详细调度，使用批处理大小等于1的输入来分析实时推理应用程序的系统性能，并展示了用于深度学习推理的应用程序以及用于计算机视觉的ShiftNet和最新的用于自然语言处理的Transformer模型。对于总数相同的收缩期细胞，Maestro具有多个较小的SA，与ShiftNet和Transformer上的单个大型SA相比，延迟分别提高了16倍和12倍。与ShiftNet和Transform的浮点GPU实施相比，具有4,096个SA（每个具有8x8收缩期细胞）的基线Maestro系统分别提供了30倍和47倍的显着延迟改进。

著录项

来源
《International Conference on Application-specific Systems, Architectures and Processors》|2019年|42-50|共9页
会议地点
作者
H. T. Kung; Bradley McDanel; Sai Qian Zhang; Xin Dong; Chih Chiang Chen;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Switches; Through-silicon vias; Memory architecture; Computational modeling; Bandwidth; Natural language processing;

机译：开关;硅通孔;内存架构;计算建模;带宽;自然语言处理;

相似文献

外文文献
中文文献
专利

1. High speed modular systolic array-based DTCWT with parallel processing architecture for 2D image transformation on FPGA [J] . Divakara S. S., Patilkulkarni Sudarshan, Raj Cyril Prasanna International Journal of Wavelets, Multiresolution and Information Processing . 2017,第5期

机译：基于高速模块化的Systolic阵列的DTCWT，具有平行处理架构，用于FPGA上的2D图像变换
2. BLOCK-PARALLEL SYSTOLIC-ARRAY ARCHITECTURE FOR 2-D NTT-BASED FRAGILE WATERMARK EMBEDDING [J] . H. L. P. ARJUNA MADANAYAKE R. J. CINTRA . S. DIMITROV L. T. BRUTON Parallel Processing Letters . 2012,第3期

机译：基于二维NTT的脆弱水印嵌入的块并行系统数组体系结构
3. Massively parallel systolic-array architectures for 2D IIR polyphase space-time plane-wave beam digital filters [J] . Arjuna Madanayake, Thushara K. Gunaratne, Len T. Bruton International journal of circuit theory and applications . 2012,第5期

机译：2D IIR多相时空平面波数字滤波器的大规模并行脉动阵列架构
4. Maestro: A Memory-on-Logic Architecture for Coordinated Parallel Use of Many Systolic Arrays [C] . H. T. Kung, Bradley McDanel, Sai Qian Zhang, International Conference on Application-specific Systems, Architectures and Processors . 2019

机译：Maestro：用于协调并行使用许多收缩阵列的记忆逻辑架构
5. Field-programmable gate array implementation of a scalable integral image architecture based on systolic arrays. [D] . De la Cruz, Juan A. 2011

机译：基于脉动阵列的可扩展积分图像体系结构的现场可编程门阵列实现。
6. A Systolic Array-Based FPGA Parallel Architecture for the BLAST Algorithm [O] . Xinyu Guo, Hong Wang, Vijay Devabhaktuni 2012

机译：用于BLAST算法的基于脉动阵列的FPGA并行架构
7. A Systolic Array-Based FPGA Parallel Architecture for the BLAST Algorithm [O] . Xinyu Guo, Hong Wang, Vijay Devabhaktuni 2012

机译：用于BLAST算法的基于收缩阵列的FPGA并行架构
8. Efficient Systolic Arrays for the Solution of Toeplitz Systems: An Illustration of a Methodology for the Construction of Systolic Architectures in VLSI (Very Large Systems Integration) [R] . Delosme, J. M., Ipsen, I. C. F. 1985

机译：用于Toeplitz系统解决方案的高效收缩阵列：VLsI（超大型系统集成）中收缩结构构建方法的示意图

Maestro: A Memory-on-Logic Architecture for Coordinated Parallel Use of Many Systolic Arrays

摘要

著录项

相似文献

相关主题

期刊订阅