REDUCT: Keep it Close, Keep it Cool! : Efficient Scaling of DNN Inference on Multi-core CPUs with Near-Cache Compute

机译：减少：保持关闭，保持冷静！：近缓存计算的多核CPU上的DNN推断的高效缩放

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Deep Neural Networks (DNN) are used in a variety of applications and services. With the evolving nature of DNNs, the race to build optimal hardware (both in datacenter and edge) continues. General purpose multi-core CPUs offer unique attractive advantages for DNN inference at both datacenter [60] and edge [71]. Most of the CPU pipeline design complexity is targeted towards optimizing general-purpose single thread performance, and is overkill for relatively simpler, but still hugely important, data parallel DNN inference workloads. Addressing this disparity efficiently can enable both raw performance scaling and overall performance/Watt improvements for multi-core CPU DNN inference.We present REDUCT, where we build innovative solutions that bypass traditional CPU resources which impact DNN inference power and limit its performance. Fundamentally, REDUCT’s "Keep it close" policy enables consecutive pieces of work to be executed close to each other. REDUCT enables instruction delivery/decode close to execution and instruction execution close to data. Simple ISA extensions encode the fixed-iteration count loop-y workload behavior enabling an effective bypass of many power-hungry front-end stages of the wide Out-of-Order (OoO) CPU pipeline. Per core performance scales efficiently by distributing light-weight tensor compute near all caches in a multi-level cache hierarchy. This maximizes the cumulative utilization of the existing architectural bandwidth resources in the system and minimizes movement of data.Across a number of DNN models, REDUCT achieves a 2.3× increase in convolution performance/Watt with a 2× to 3.94× scaling in raw performance. Similarly, REDUCT achieves a 1.8× increase in inner-product performance/Watt with 2.8× scaling in performance. REDUCT performance/power scaling is achieved with no increase to cache capacity or bandwidth and a mere 2.63% increase in area. Crucially, REDUCT operates entirely within the CPU programming and memory model, simplifying software development, while achieving performance similar to or better than state-of-the-art Domain Specific Accelerators (DSA) for DNN inference, providing fresh design choices in the AI era.

机译：深度神经网络（DNN）用于各种应用和服务。随着DNN的不断变化的性质，将继续构建最佳硬件的种族（在数据中心和边缘中）继续。通用多核CPU为DNN推断提供独特的有吸引力的DNN推断[60]和边缘[71]。大多数CPU管道设计复杂性是针对优化通用单线性能的目标，并且对于比较简单的矫枉过正，但仍然很重要，数据并行DNN推理工作负载。有效地解决此差异可以为多核CPU DNN推断进行原始性能缩放和整体性能/瓦特改进.WE呈现过滤，在那里建立创新解决方案，绕过传统的CPU资源，影响DNN推理功率并限制其性能。从根本上，减少的“保持关闭”策略使得连续工作能够彼此靠近执行。减减使指令交付/解码靠近执行和指令执行靠近数据。简单的ISA扩展为固定迭代计数Loop-Y工作负载行为进行编码，使得能够有效地绕过许多功率饥饿的前端阶段的广泛秩序（OOO）CPU管道。通过在多级缓存层次结构中分配所有高速缓存附近的轻量级张量计算，每核性能有效地缩放。这最大化了系统中现有的架构带宽资源的累积利用率，并最大限度地减少了数据的移动。纪录的许多DNN型号，在原始性能中，卷积性能/瓦特的速度达到2.3倍增加。同样，在性能下，减减内部产品性能/瓦特的增长1.8倍。降低性能/功率缩放，不会增加到缓存容量或带宽，面积增加2.63％。令人遗憾的是，减码完全在CPU编程和内存模型内运行，简化软件开发，同时实现与DNN推断的最先进的域特定加速器（DSA）相似或更好地实现的性能，在AI时代提供新的设计选择。

著录项

来源
《ACM/IEEE Annual International Symposium on Computer Architecture》|2021年|167-180|共14页
会议地点
作者
Anant V. Nori; Rahul Bera; Shankar Balachandran; Joydeep Rakshit; Om J. Omer; Avishaii Abuhatzera; Belliappa Kuttanna; Sreenivas Subramoney;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Out of order; Deep learning; Tensors; Convolution; Computational modeling; Pipelines; Bandwidth;

机译：出于秩序;深入学习;张量;卷积;计算建模;管道;带宽;

相似文献

外文文献
中文文献
专利

1. New Algorithm for Tensor Contractions on Multi-Core CPUs, GPUs, and Accelerators Enables CCSD and EOM-CCSD Calculations with over 1000 Basis Functions on a Single Compute Node [J] . Kaliman Ilya A., Krylov Anna I. Journal of Computational Chemistry: Organic, Inorganic, Physical, Biological . 2017,第11a12期

机译：多核CPU，GPU和加速器上的张量凹陷的新算法使CCSD和EOM-CCSD计算能够在单个计算节点上具有超过1000个基础函数的计算
2. Performance analysis of SSE and AYX instructions in multi-core CPUs and GPU computing on FDTD scheme for solid and fluid vibration problems [J] . Jorge Frances, Sergio Bleda, Andres Marquez, Journal of supercomputing . 2014,第2期

机译：多核CPU中SSE和AYX指令的性能分析以及基于FDTD方案的GPU计算的固体和流体振动问题
3. High-performance Physics Simulations Using Multi-core CPUs and GPGPUs in a Volunteer Computing Context [J] . Kamran Karimi, Neil Dickson, Firas Hamze International Journal of High Performance Computing Applications . 2011,第1期

机译：在志愿者计算环境中使用多核CPU和GPGPU进行高性能物理模拟
4. SAVE: Sparsity-Aware Vector Engine for Accelerating DNN Training and Inference on CPUs [C] . Zhangxiaowen Gong, Houxiang Ji, Christopher W. Fletcher, International Symposium on Multidisciplinary Studies and Innovative Technologies . 2020

机译：SAVE：稀疏感知矢量引擎，用于在CPU上加速DNN训练和推理
5. Efficient and Scalable Parallel Stochastic Gradient Descent on a Heterogeneous CPU-FPGA platform for Large Scale Machine Learning [D] . Rasoori, Sandeep. 2017

机译：用于大规模机器学习的异构CPU-FPGA平台上高效且可伸缩的平行随机梯度下降
6. Application Performance Analysis and Efficient Execution on Systems with multi-core CPUs GPUs and MICs: A Case Study with Microscopy Image Analysis [O] . George Teodoro, Tahsin Kurc, Guilherme Andrade, -1

机译：具有多核CPUGPU和MIC的系统上的应用程序性能分析和高效执行：以显微镜图像分析为例
7. Performance analysis of SSE and AVX instructions in multi-core CPUs and GPU computing on FDTD scheme for solid and fluid vibration problems [O] . Francés, Jorge, Bleda, Sergio, Márquez, Andrés, 10000

机译：多核CpU中的ssE和aVX指令的性能分析以及针对固体和流体振动问题的FDTD方案的GpU计算

REDUCT: Keep it Close, Keep it Cool! : Efficient Scaling of DNN Inference on Multi-core CPUs with Near-Cache Compute

摘要

著录项

相似文献

相关主题

期刊订阅