首页> 外文会议>International Conference on Field Programmable Logic and Applications >Embracing Diversity: Enhanced DSP Blocks for Low-Precision Deep Learning on FPGAs
【24h】

Embracing Diversity: Enhanced DSP Blocks for Low-Precision Deep Learning on FPGAs

机译:拥抱多样性:增强型DSP模块,用于FPGA上的低精度深度学习

获取原文

摘要

Use of reduced precisions in Deep Learning (DL) inference tasks has recently been shown to significantly improve accelerator performance and greatly reduce both model memory footprint and the required external memory bandwidth. With appropriate network retuning, reduced precision networks can achieve accuracy close or equal to that of full-precision floating-point models. Given the wide spectrum of precisions used in DL inference, FPGAs' ability to create custom bit-width datapaths gives them an advantage over other acceleration platforms in this domain. However, the embedded DSP blocks in the latest Intel and Xilinx FPGAs do not natively support precisions below 18-bit and thus can not efficiently pack low-precision multiplications, leaving the DSP blocks under-utilized. In this work, we present an enhanced DSP block that can efficiently pack 2× as many 9-bit and 4× as many 4-bit multiplications compared to the baseline Arria-10-like DSP block at the cost of 12% block area overhead which leads to only 0.6% total FPGA core area increase. We quantify the performance gains of using this enhanced DSP block in two state-of-the-art convolutional neural network accelerators on three different models: AlexNet, VGG-16, and ResNet-50. On average, the new DSP block enhanced the computational performance of the 8-bit and 4-bit accelerators by 1.32× and 1.6× and at the same time reduced the utilized chip area by 15% and 30% respectively.
机译:最近显示,在深度学习(DL)推理任务中使用降低的精度可以显着提高加速器性能,并大大减少模型内存占用量和所需的外部内存带宽。通过适当的网络调整,降低精度的网络可以获得的精度接近或等于全精度浮点模型的精度。鉴于在DL推理中使用了各种各样的精度,FPGA创建自定义位宽数据路径的能力为其提供了优于该领域其他加速平台的优势。但是,最新的Intel和Xilinx FPGA中的嵌入式DSP块本身并不支持18位以下的精度,因此无法有效地打包低精度乘法,从而使DSP块未得到充分利用。在这项工作中,我们提出了一种增强型DSP块,与基线Arria-10-类DSP块相比,它可以有效打包2倍的9位乘法和4倍的4位乘法,而代价是占用12%的块面积这导致FPGA核心总面积仅增加0.6%。我们在三个不同模型上的两个最先进的卷积神经网络加速器中使用此增强的DSP块来量化性能提升,这些模型分别是AlexNet,VGG-16和ResNet-50。平均而言,新的DSP模块将8位和4位加速器的计算性能提高了1.32倍和1.6倍,同时将占用的芯片面积分别减少了15%和30%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号