首页> 外文期刊>IEEE Transactions on Computers >A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets
【24h】

A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

机译:一种可扩展的近记忆架构,用于培训大型内存数据集的深神经网络

获取原文
获取原文并翻译 | 示例

摘要

Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NTX that can be used to train state-of-the-art deep convolutional neural networks at scale. Our main contributions are: (i) a loose coupling of RISC-V cores and NTX co-processors reducing offloading overhead by 7 x over previously published results; (ii) an optimized IEEE 754 compliant data path for fast high-precision convolutions and gradient propagation; (iii) evaluation of near-memory computing with NTX embedded into residual area on the Logic Base die of a Hybrid Memory Cube; and (iv) a scaling analysis to meshes of HMCs in a data center scenario. We demonstrate a 2.7 x energy efficiency improvement of NTX over contemporary GPUs at 4.4 x less silicon area, and a compute performance of 1.2 Tflop/s for training large state-of-the-art networks with full floating-point precision. At the data center scale, a mesh of NTX achieves above 95 percent parallel and energy efficiency, while providing 2.1 x energy savings or 3.1 x performance improvement over a GPU-based system.
机译:对于深度神经网络的近记忆硬件加速器来说,大多数研究都主要专注于推论,而迄今为止加速培训的潜力相对较少。基于对基于最先进的基于梯度的培训方法的关键计算模式的深入分析,我们提出了一种称为NTX的有效的近记忆加速引擎,可用于培训最先进的深度卷积神经网络尺寸。我们的主要贡献是:(i)RISC-V核和NTX协处理器的松散耦合,减少了7 x以前公布的结果将开销卸载过上; (ii)用于快速高精度卷积和梯度传播的优化IEEE 754兼容数据路径; (iii)对嵌入到混合内存立方体逻辑基芯片磁场上的NTX近记忆计算的评估; (iv)在数据中心场景中对HMC网的缩放分析。我们展示了4.4 x少44秒的现代GPU的2.7倍的能效改善了4.4 x的硅面积,并为1.2 TFLOP / S的计算性能,用于训练大型最先进的网络,具有完整的浮点精度。在数据中心刻度,NTX的网格达到了95%的平行和能效,同时提供了2.1 x节能或基于GPU的系统的3.1 x性能改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号