首页> 外文会议>ACM/IEEE Annual International Symposium on Computer Architecture >REDUCT: Keep it Close, Keep it Cool! : Efficient Scaling of DNN Inference on Multi-core CPUs with Near-Cache Compute
【24h】

REDUCT: Keep it Close, Keep it Cool! : Efficient Scaling of DNN Inference on Multi-core CPUs with Near-Cache Compute

机译:减少:保持关闭,保持冷静! :近缓存计算的多核CPU上的DNN推断的高效缩放

获取原文

摘要

Deep Neural Networks (DNN) are used in a variety of applications and services. With the evolving nature of DNNs, the race to build optimal hardware (both in datacenter and edge) continues. General purpose multi-core CPUs offer unique attractive advantages for DNN inference at both datacenter [60] and edge [71]. Most of the CPU pipeline design complexity is targeted towards optimizing general-purpose single thread performance, and is overkill for relatively simpler, but still hugely important, data parallel DNN inference workloads. Addressing this disparity efficiently can enable both raw performance scaling and overall performance/Watt improvements for multi-core CPU DNN inference.We present REDUCT, where we build innovative solutions that bypass traditional CPU resources which impact DNN inference power and limit its performance. Fundamentally, REDUCT’s "Keep it close" policy enables consecutive pieces of work to be executed close to each other. REDUCT enables instruction delivery/decode close to execution and instruction execution close to data. Simple ISA extensions encode the fixed-iteration count loop-y workload behavior enabling an effective bypass of many power-hungry front-end stages of the wide Out-of-Order (OoO) CPU pipeline. Per core performance scales efficiently by distributing light-weight tensor compute near all caches in a multi-level cache hierarchy. This maximizes the cumulative utilization of the existing architectural bandwidth resources in the system and minimizes movement of data.Across a number of DNN models, REDUCT achieves a 2.3× increase in convolution performance/Watt with a 2× to 3.94× scaling in raw performance. Similarly, REDUCT achieves a 1.8× increase in inner-product performance/Watt with 2.8× scaling in performance. REDUCT performance/power scaling is achieved with no increase to cache capacity or bandwidth and a mere 2.63% increase in area. Crucially, REDUCT operates entirely within the CPU programming and memory model, simplifying software development, while achieving performance similar to or better than state-of-the-art Domain Specific Accelerators (DSA) for DNN inference, providing fresh design choices in the AI era.
机译:深度神经网络(DNN)用于各种应用和服务。随着DNN的不断变化的性质,将继续构建最佳硬件的种族(在数据中心和边缘中)继续。通用多核CPU为DNN推断提供独特的有吸引力的DNN推断[60]和边缘[71]。大多数CPU管道设计复杂性是针对优化通用单线性能的目标,并且对于比较简单的矫枉过正,但仍然很重要,数据并行DNN推理工作负载。有效地解决此差异可以为多核CPU DNN推断进行原始性能缩放和整体性能/瓦特改进.WE呈现过滤,在那里建立创新解决方案,绕过传统的CPU资源,影响DNN推理功率并限制其性能。从根本上,减少的“保持关闭”策略使得连续工作能够彼此靠近执行。减减使指令交付/解码靠近执行和指令执行靠近数据。简单的ISA扩展为固定迭代计数Loop-Y工作负载行为进行编码,使得能够有效地绕过许多功率饥饿的前端阶段的广泛秩序(OOO)CPU管道。通过在多级缓存层次结构中分配所有高速缓存附近的轻量级张量计算,每核性能有效地缩放。这最大化了系统中现有的架构带宽资源的累积利用率,并最大限度地减少了数据的移动。纪录的许多DNN型号,在原始性能中,卷积性能/瓦特的速度达到2.3倍增加。同样,在性能下,减减内部产品性能/瓦特的增长1.8倍。降低性能/功率缩放,不会增加到缓存容量或带宽,面积增加2.63%。令人遗憾的是,减码完全在CPU编程和内存模型内运行,简化软件开发,同时实现与DNN推断的最先进的域特定加速器(DSA)相似或更好地实现的性能,在AI时代提供新的设计选择。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号