首页> 外文期刊>IEEE transactions on very large scale integration (VLSI) systems >Accelerating Recurrent Neural Networks: A Memory-Efficient Approach
【24h】

Accelerating Recurrent Neural Networks: A Memory-Efficient Approach

机译:加速循环神经网络:一种内存有效的方法

获取原文
获取原文并翻译 | 示例

摘要

Recurrent neural networks (RNNs) have achieved the state-of-the-art performance on various sequence learning tasks due to their powerful sequence modeling capability. However, RNNs usually require a large number of parameters and high computational complexity. Hence, it is quite challenging to implement complex RNNs on embedded devices with stringent memory and latency requirement. In this paper, we first present a novel hybrid compression method for a widely used RNN variant, long-short term memory (LSTM), to tackle these implementation challenges. By properly using circulant matrices, forward nonlinear function approximation, and efficient quantization schemes with a retrain-based training strategy, the proposed compression method can reduce more than 95% of memory usage with negligible accuracy loss when verified under language modeling and speech recognition tasks. An efficient scalable parallel hardware architecture is then proposed for the compressed LSTM. With an innovative chessboard division method for matrix-vector multiplications, the parallelism of the proposed hardware architecture can be freely chosen under certain latency requirement. Specifically, for the circulant matrix-vector multiplications employed in the compressed LSTM, the circulant matrices are judiciously reorganized to fit in with the chessboard division and minimize the number of memory accesses required for the matrix multiplications. The proposed architecture is modeled using register transfer language (RTL) and synthesized under the TSMC 90-nm CMOS technology. With 518.5-kB on-chip memory, we are able to process a 512×512 compressed LSTM in 1.71 μs, corresponding to 2.46 TOPS on the uncompressed one, at a cost of 30.77-mm chip area. The implementation results demonstrate that the proposed design can achieve significantly high flexibility and area efficiency, which satisfies many real-time applications on embedded devices. It is worth mentioning that the memory-efficient approach of accelerating LSTM developed in this paper is also applicable to other RNN variants.
机译:递归神经网络(RNN)具有强大的序列建模能力,因此已在各种序列学习任务上达到了最先进的性能。但是,RNN通常需要大量参数和高计算复杂性。因此,在具有严格内存和延迟要求的嵌入式设备上实现复杂的RNN颇具挑战性。在本文中,我们首先针对广泛使用的RNN变体长短期记忆(LSTM)提出了一种新颖的混合压缩方法,以解决这些实现难题。通过正确使用循环矩阵,正向非线性函数逼近和有效的量化方案以及基于再训练的训练策略,当在语言建模和语音识别任务下进行验证时,所提出的压缩方法可以减少95%以上的内存使用,而精度损失可忽略不计。然后为压缩的LSTM提出了一种有效的可扩展并行硬件体系结构。利用用于矩阵矢量乘法的创新棋盘分割方法,可以在一定的等待时间要求下自由选择所提出的硬件体系结构的并行性。具体来说,对于在压缩LSTM中使用的循环矩阵向量乘法,会明智地重新组织循环矩阵以适合棋盘划分,并最大程度地减少矩阵乘法所需的内存访问次数。所提出的架构使用寄存器传输语言(RTL)进行建模,并在台积电90纳米CMOS技术下进行了综合。利用518.5 kB的片上存储器,我们能够在1.71μs的时间内处理512×512压缩的LSTM,相当于未压缩的LSTM的2.46 TOPS,而芯片面积仅为30.77 mm。实施结果表明,所提出的设计可以实现很高的灵活性和面积效率,满足嵌入式设备上的许多实时应用。值得一提的是,本文开发的加速LSTM的内存有效方法也适用于其他RNN变体。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号