首页> 外文会议>International Conference on Application-specific Systems, Architectures and Processors >Maestro: A Memory-on-Logic Architecture for Coordinated Parallel Use of Many Systolic Arrays
【24h】

Maestro: A Memory-on-Logic Architecture for Coordinated Parallel Use of Many Systolic Arrays

机译:Maestro:协调并行使用许多脉动阵列的逻辑存储器

获取原文

摘要

We present the Maestro memory-on-logic 3D-IC architecture for coordinated parallel use of a plurality of systolic arrays (SAs) in performing deep neural network (DNN) inference. Maestro reduces under-utilization common for a single large SA by allowing parallel use of many smaller SAs on DNN weight matrices of varying shapes and sizes. In order to buffer immediate results in memory blocks (MBs) and provide coordinated high-bandwidth communication between SAs and MBs in transferring weights and results Maestro employs three innovations. (1) An SA on the logic die can access its corresponding MB on the memory die in short distance using 3D-IC interconnects, (2) through an efficient switch based on H-trees, an SA can access any MB with low latency, and (3) the switch can combine partial results from SAs in an elementwise fashion before writing back to a destination MB. We describe the Maestro architecture, including a circuit and layout design, detail scheduling of the switch, analyze system performance for real-time inference applications using input with batch size equal to one, and showcase applications for deep learning inference, with ShiftNet for computer vision and recent Transformer models for natural language processing. For the same total number of systolic cells, Maestro, with multiple smaller SAs, leads to 16x and 12x latency improvements over a single large SA on ShiftNet and Transformer, respectively. Compared to a floating-point GPU implementation of ShiftNet and Transform, a baseline Maestro system with 4,096 SAs (each with 8x8 systolic cells) provides significant latency improvements of 30x and 47x, respectively.
机译:我们提出了Maestro逻辑存储3D-IC体系结构,用于在执行深度神经网络(DNN)推理时协调并行使用多个脉动阵列(SA)。 Maestro通过允许在形状和大小不同的DNN权重矩阵上并行使用许多较小的SA,来减少单个大型SA常见的利用率不足问题。为了在存储块(MB)中缓冲立即结果并在SA和MB之间提供协调的高带宽通信以传递权重,结果Maestro采用了三项创新。 (1)逻辑芯片上的SA可以使用3D-IC互连在短距离内访问其存储器芯片上的相应MB,(2)通过基于H树的高效开关,SA可以访问低延迟的任何MB, (3)交换机可以在写回目标MB之前,以元素方式合并来自SA的部分结果。我们描述了Maestro架构,包括电路和布局设计,开关的详细调度,使用批处理大小等于1的输入来分析实时推理应用程序的系统性能,并展示了用于深度学习推理的应用程序以及用于计算机视觉的ShiftNet和最新的用于自然语言处理的Transformer模型。对于总数相同的收缩期细胞,Maestro具有多个较小的SA,与ShiftNet和Transformer上的单个大型SA相比,延迟分别提高了16倍和12倍。与ShiftNet和Transform的浮点GPU实施相比,具有4,096个SA(每个具有8x8收缩期细胞)的基线Maestro系统分别提供了30倍和47倍的显着延迟改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号