The rising popularity of intelligent mobile devices and the dauntingcomputational cost of deep learning-based models call for efficient andaccurate on-device inference schemes. We propose a quantization scheme thatallows inference to be carried out using integer-only arithmetic, which can beimplemented more efficiently than floating point inference on commonlyavailable integer-only hardware. We also co-design a training procedure topreserve end-to-end model accuracy post quantization. As a result, the proposedquantization scheme improves the tradeoff between accuracy and on-devicelatency. The improvements are significant even on MobileNets, a model familyknown for run-time efficiency, and are demonstrated in ImageNet classificationand COCO detection on popular CPUs.
展开▼