The performance of convolutional neural networks for computer vision and other applications has crossed human accuracy levels on high-end systems. The demand for these applications in small, mobile hardware is increasing while expecting the same performance. These devices have considerably smaller memory and power budgets. Prior work on model compression for inference on edge devices has sacrificed some accuracy to compress the models. We propose a novel model compression approach by sharing exponents of weights stored in IEEE floating-point format. This approach does not require any fine-tuning after compression. We demonstrate our technique on different trained models resulting in nearly 10% compression in storage and requiring less than 1.5 times the original execution time.
展开▼