首页> 外文期刊>IEEE Transactions on Computers >A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets
【24h】

A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

机译:可扩展的近内存体系结构,用于在大型内存数据集中训练深度神经网络

获取原文
获取原文并翻译 | 示例

摘要

Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NTX that can be used to train state-of-the-art deep convolutional neural networks at scale. Our main contributions are: (i) a loose coupling of RISC-V cores and NTX co-processors reducing offloading overhead by 7 x over previously published results; (ii) an optimized IEEE 754 compliant data path for fast high-precision convolutions and gradient propagation; (iii) evaluation of near-memory computing with NTX embedded into residual area on the Logic Base die of a Hybrid Memory Cube; and (iv) a scaling analysis to meshes of HMCs in a data center scenario. We demonstrate a 2.7 x energy efficiency improvement of NTX over contemporary GPUs at 4.4 x less silicon area, and a compute performance of 1.2 Tflop/s for training large state-of-the-art networks with full floating-point precision. At the data center scale, a mesh of NTX achieves above 95 percent parallel and energy efficiency, while providing 2.1 x energy savings or 3.1 x performance improvement over a GPU-based system.
机译:对于深度神经网络的近内存硬件加速器的大多数研究都主要集中在推理上,而到目前为止,加速训练的潜力却很少受到关注。在对基于梯度的最新训练方法中的关键计算模式进行深入分析的基础上,我们提出了一种称为NTX的有效近内存加速引擎,该引擎可用于训练最新技术深度卷积神经网络我们的主要贡献是:(i)RISC-V内核和NTX协处理器的松散耦合,使卸载开销比以前公布的结果减少了7倍; (ii)优化的符合IEEE 754的数据路径,用于快速高精度卷积和梯度传播; (iii)使用混合内存多维数据集的逻辑基础芯片的剩余区域中嵌入的NTX评估近内存计算; (iv)在数据中心场景中对HMC的网格进行缩放分析。我们证明,与现有的GPU相比,NTX的能效提高了2.7倍,而硅面积却减少了4.4倍,而计算性能为1.2 Tflop / s,可用于以最大的浮点精度训练大型的最新网络。在数据中心规模上,NTX网格可实现95%以上的并行和能效,同时与基于GPU的系统相比,可节省2.1倍的能耗或3.1倍的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号