首页> 外文会议>International Conference on Field-Programmable Technology >Scalable Low-Latency Persistent Neural Machine Translation on CPU Server with Multiple FPGAs
【24h】

Scalable Low-Latency Persistent Neural Machine Translation on CPU Server with Multiple FPGAs

机译:具有多个FPGA的CPU服务器上可扩展的低延迟持久性神经电机翻译

获取原文
获取外文期刊封面目录资料

摘要

We present a CPU server with multiple FPGAs that is purely software-programmable by a unified framework to enable flexible implementation of modern real-life complex AI that scales to large model size (100M+ parameters), while delivering real-time inference latency (~ms). Using multiple FPGAs, we scale by keeping a large model persistent in on-chip memories across FPGAs to avoid costly off-chip accesses. We study systems with 1 to 8 FPGAs for different devices: Intel? Arria? 10, Stratix? 10, and a research Stratix 10 with an AI chiplet. We present the first multi-FPGA evaluation of a complex NMT with bi-directional LSTMs, attention, and beam search. Our system scales well. Going from 1 to 8 FPGAs allows hosting ~8× larger model with only ~2× latency increase. A batch-1 inference for a 100M-parameter NMT on 8 Stratix 10 FPGAs takes only ~10 ms. This system offers 110× better latency than the only prior NMT work on FPGAs, which uses a high-end FPGA and stores the model off-chip.
机译:我们呈现一个具有多个FPGA的CPU服务器,统一框架纯粹是可编程的,以便能够灵活地实现现代实际复杂AI,其缩放到大型模型大小(100m +参数),同时提供实时推断延迟(〜MS )。使用多个FPGA,通过在FPGA跨越FPGA的片上存储器中保持大型模型来缩放,以避免昂贵的片外访问。我们为不同的设备学习1到8个FPGA的系统:英特尔? arria? 10,Stratix? 10,以及具有AI小芯片的研究Stratix 10。我们介绍了具有双向LSTM,注意力和光束搜索的复杂NMT的第一个多FPGA评估。我们的系统衡量良好。从1到8个FPGA允许托管〜8×更大的模型,只有〜2×延迟增加。在8个Stratix 10 FPGA上的100M参数NMT的批次-1推断只需要〜10 ms。该系统提供110×更好的延迟,而不是FPGA上的唯一先前的NMT工作,它使用高端FPGA并将模型的片材存储在内。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号