首页> 外文会议>IEEE International Symposium on High Performance Computer Architecture >Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework
【24h】

Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework

机译:混合和匹配:一种以FPGA为中心的深度神经网络量化框架

获取原文

摘要

Deep Neural Networks (DNNs) have achieved extraordinary performance in various application domains. To support diverse DNN models, efficient implementations of DNN inference on edge-computing platforms, e.g., ASICs, FPGAs, and embedded systems, are extensively investigated. Due to the huge model size and computation amount, model compression is a critical step to deploy DNN models on edge devices. This paper focuses on weight quantization, a hardware-friendly model compression approach that is complementary to weight pruning.Unlike existing methods that use the same quantization scheme for all weights, we propose the first solution that applies different quantization schemes for different rows of the weight matrix. It is motivated by (1) the distribution of the weights in the different rows are not the same; and (2) the potential of achieving better utilization of heterogeneous FPGA hardware resources. To achieve that, we first propose a hardware-friendly quantization scheme named sum-of-power-of-2 (SP2) suitable for Gaussian-like weight distribution, in which the multiplication arithmetic can be replaced with logic shifter and adder, thereby enabling highly efficient implementations with the FPGA LUT resources. In contrast, the existing fixed-point quantization is suitable for Uniform-like weight distribution and can be implemented efficiently by DSP. Then to fully explore the resources, we propose an FPGA-centric mixed scheme quantization (MSQ) with an ensemble of the proposed SP2 and the fixed-point schemes. Combining the two schemes can maintain, or even increase accuracy due to better matching with weight distributions.For the FPGA implementations, we develop a parameterized architecture with heterogeneous Generalized Matrix Multiplication (GEMM) cores—one using LUTs for computations with SP2 quantized weights and the other utilizing DSPs for fixed-point quantized weights. Given the partition ratio among the two schemes based on resource characterization, MSQ quantization training algorithm derives an optimally quantized model for the FPGA implementation. We evaluate our FPGA-centric quantization framework across multiple application domains. With optimal SP2/fixed-point ratios on two FPGA devices, i.e., Zynq XC7Z020 and XC7Z045, we achieve performance improvement of 2.1 × -4.1 × compared to solely exploiting DSPs for all multiplication operations. In addition, the CNN implementations with the proposed MSQ scheme can achieve higher accuracy and comparable hardware utilization efficiency compared to the state-of-the-art designs.
机译:深度神经网络(DNN)在各种应用领域中取得了非凡的性能。为了支持多样化的DNN模型,广泛地研究了对边缘计算平台上DNN推断的高效实现,例如ASIC,FPGA和嵌入式系统。由于巨大的模型大小和计算量,模型压缩是在边缘设备上部署DNN模型的关键步骤。本文重点介绍了重量量化,一种硬件友好的模型压缩方法,它是重量灌注的互补。使用相同量化方案的互补方法,用于所有重量的方法,我们提出了为不同行的重量行应用不同量化方案的第一种解决方案矩阵。它具有(1)不同行中权重的分布不一样; (2)实现更好地利用异构FPGA硬件资源的可能性。为此,我们首先提出了一种硬件友好的量化方案,该方案名为-2(SP2)的总和适用于高斯的重量分布,其中乘法算术可以用逻辑移位器和加法器替换,从而实现具有FPGA LUT资源的高效实现。相反,现有的定点量化适用于均匀的重量分布,并且可以通过DSP有效地实现。然后,为了充分探索资源,我们提出了一种以FPGA为中心的混合方案量化(MSQ),具有所提出的SP2和定点方案的集合。组合两种方案可以维持,甚至增加精度,因为FPGA实现更好地匹配,我们使用LUTS与SP2量化权重的计算开发了具有异构通用矩阵乘法(Gemm)核心的参数化体系结构。其他利用DSP用于定点量化的重量。鉴于基于资源表征的两个方案中的分区比,MSQ量化训练算法导出了用于FPGA实现的最佳量化模型。我们在多个应用域中评估了我们的FPGA的量化框架。在两个FPGA器件上的最佳SP2 /定点比率,即Zynq XC7Z020和XC7Z045,我们实现了2.1×-4.1倍的性能提高,与所有乘法运算仅利用DSP。此外,与所提出的MSQ方案的CNN实现可以实现与最先进的设计相比的更高的精度和相当的硬件利用效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号