首页> 外文会议>International Symposium on Microarchitecture >Towards Memory Friendly Long-Short Term Memory Networks (LSTMs) on Mobile GPUs
【24h】

Towards Memory Friendly Long-Short Term Memory Networks (LSTMs) on Mobile GPUs

机译:在移动GPU上的记忆友好的长期记忆网络(LSTMS)

获取原文

摘要

Intelligent Personal Assistants (IPAs) with the capability of natural language processing (NLP) are increasingly popular in today's mobile devices. Recurrent neural networks (RNNs), especially one of their forms - Long-Short Term Memory networks (LSTMs), are becoming the core machine learning technique applied in the NLP-based IPAs. With the continuously improved performance of mobile GPUs, local processing has become a promising solution to the large data transmission and privacy issues induced by the cloud-centric computations of IPAs. However, LSTMs exhibit quite inefficient memory access pattern when executed on mobile GPUs due to the redundant data movements and limited off-chip bandwidth. In this study, we aim to explore the memory friendly LSTM on mobile GPUs by hierarchically reducing the off-chip memory accesses. To address the redundant data movements, we propose inter-cell level optimizations that intelligently parallelize the originally sequentially executed LSTM cells (basic units in RNNs, corresponding to neurons in CNNs) to improve the data locality across cells with negligible accuracy loss. To relax the pressure on limited off-chip memory bandwidth, we propose intra-cell level optimizations that dynamically skip the loads and computations of rows in the weight matrices with trivial contribution to the outputs. We also introduce a light-weighted module to the GPUs architecture for the runtime row skipping in weight matrices. Moreover, our techniques are equipped with thresholds which provide a unique tunning space for performance-accuracy trade-offs directly guided by the user preferences. The experimental results show our optimizations achieves substantial improvements on both performance and power with user-imperceptible accuracy loss. And our optimizations exhibit the strong scalability with the increasing input data set. Our user study also shows that our designed system delivers the excellent user experience.
机译:智能个人助理(IPAS)自然语言处理(NLP)的能力在今天的移动设备越来越普及。递归神经网络(RNNs),其形式尤其是 - 长短期记忆网络(LSTMs),正成为核心机器学习在基于自然语言处理,投资促进机构的应用技术。随着移动GPU的不断提高性能,本地处理已成为一个有前途的解决方案,通过投资促进机构的云计算为中心引起的大数据传输和隐私问题。然而,当上移动GPU由于冗余数据执行运动和有限的芯片外带宽LSTMs表现出相当低效的存储器存取模式。在这项研究中,我们的目标是通过分层减少截止芯片存储器访问,以探索在移动GPU内存友好LSTM。为了解决所述冗余数据的运动,我们提出小区间级别的优化,可以智能并行化原本顺序执行LSTM细胞(基本单元在RNNs,对应于神经元细胞神经网络),以改善具有可忽略的精度损失横跨单元中的数据局部性。放宽对限定的片外存储器带宽的压力,我们提出小区内电平的优化,而动态地跳过行的负载,并且计算与到输出琐碎贡献的权重矩阵。我们还介绍了光加权模块到GPU的架构为运行时行中的权重矩阵跳过。此外,我们的技术装备,其性能,精度取舍由用户的喜好直接引导提供了一个独特tunning空间阈值。实验结果表明,我们的优化实现了对性能和功耗与用户察觉不到的精度损失实质性的改进。而我们的优化表现出与增加输入数据集的可扩展性强。我们的用户研究还表明,我们设计的系统提供了卓越的用户体验。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号