Scalable Low-Latency Persistent Neural Machine Translation on CPU Server with Multiple FPGAs

机译：具有多个FPGA的CPU服务器上可扩展的低延迟持久性神经电机翻译

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We present a CPU server with multiple FPGAs that is purely software-programmable by a unified framework to enable flexible implementation of modern real-life complex AI that scales to large model size (100M+ parameters), while delivering real-time inference latency (~ms). Using multiple FPGAs, we scale by keeping a large model persistent in on-chip memories across FPGAs to avoid costly off-chip accesses. We study systems with 1 to 8 FPGAs for different devices: Intel? Arria? 10, Stratix? 10, and a research Stratix 10 with an AI chiplet. We present the first multi-FPGA evaluation of a complex NMT with bi-directional LSTMs, attention, and beam search. Our system scales well. Going from 1 to 8 FPGAs allows hosting ~8× larger model with only ~2× latency increase. A batch-1 inference for a 100M-parameter NMT on 8 Stratix 10 FPGAs takes only ~10 ms. This system offers 110× better latency than the only prior NMT work on FPGAs, which uses a high-end FPGA and stores the model off-chip.

机译：我们呈现一个具有多个FPGA的CPU服务器，统一框架纯粹是可编程的，以便能够灵活地实现现代实际复杂AI，其缩放到大型模型大小（100m +参数），同时提供实时推断延迟（〜MS ）。使用多个FPGA，通过在FPGA跨越FPGA的片上存储器中保持大型模型来缩放，以避免昂贵的片外访问。我们为不同的设备学习1到8个FPGA的系统：英特尔？ arria？ 10，Stratix？ 10，以及具有AI小芯片的研究Stratix 10。我们介绍了具有双向LSTM，注意力和光束搜索的复杂NMT的第一个多FPGA评估。我们的系统衡量良好。从1到8个FPGA允许托管〜8×更大的模型，只有〜2×延迟增加。在8个Stratix 10 FPGA上的100M参数NMT的批次-1推断只需要〜10 ms。该系统提供110×更好的延迟，而不是FPGA上的唯一先前的NMT工作，它使用高端FPGA并将模型的片材存储在内。

著录项

来源
《International Conference on Field-Programmable Technology》|2019年|485p|共4页
会议地点
作者
Eriko Nurvitadhi; Andrew Boutros; Prerna Budhkar; Ali Jafari; Dongup Kwon; David Sheffield; Abirami Prabhakaran; Karthik Gururaj; Pranavi Appana; Mishali Naik;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类控制器、控制台;
关键词
field programmable gate arrays; integrated circuit modelling; microprocessor chips; neural nets; synchronisation;

机译：现场可编程门阵列;集成电路建模;微处理器芯片;神经网络;同步;

相似文献

外文文献
中文文献
专利

1. CPUs, GPUs, FPGAs: How to choose the best method for your machine vision application [J] . Vision Systems Design . 2020,第2期

机译：CPU，GPU，FPGA：如何为您的机器视觉应用选择最佳方法
2. Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to Target FPGAs, GPUs, and CPUs [J] . Macintosh Hamish J., Banks Jasmine E., Kelson Neil A. International journal of reconfigurable computing . 2019,第PTa1期

机译：实现和评估具有OpenCL的异构，可伸缩的Tridgonal线性系统求解器，以靶向FPGA，GPU和CPU
3. Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to Target FPGAs, GPUs, and CPUs [J] . Hamish J. Macintosh, Jasmine E. Banks, Neil A. Kelson International journal of reconfigurable computing . 2019,第5aaPagea2期

机译：实现和评估具有OpenCL的异构，可伸缩的Tridgonal线性系统求解器，以靶向FPGA，GPU和CPU
4. Scalable Low-Latency Persistent Neural Machine Translation on CPU Server with Multiple FPGAs [C] . Eriko Nurvitadhi, Andrew Boutros, Prerna Budhkar, International Conference on Field-Programmable Technology . 2019

机译：具有多个FPGA的CPU服务器上的可扩展低延迟持久性神经机器翻译
5. Efficient and Scalable Parallel Stochastic Gradient Descent on a Heterogeneous CPU-FPGA platform for Large Scale Machine Learning [D] . Rasoori, Sandeep. 2017

机译：用于大规模机器学习的异构CPU-FPGA平台上高效且可伸缩的平行随机梯度下降
6. A novel CPU/GPU simulation environment for large-scale biologically realistic neural modeling [O] . Roger V. Hoang, Devyani Tanna, Laurence C. Jayet Bray, 2013

机译：用于大规模生物逼真的神经建模的新型CPU / GPU仿真环境
7. Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU [O] . Devlin, Jacob 2017

机译：沉闷硬件上的锐利模型：快速准确的神经机器在CpU上进行翻译解码

Scalable Low-Latency Persistent Neural Machine Translation on CPU Server with Multiple FPGAs

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅