High-performance Activity Recognition models from video data are difficult to train and deploy efficiently. We measureefficiency in performance, model size, and run-time; during training and inference. Researchers have demonstrated that3D convolutions capture the space-time dynamics well [13]. The challenge is that 3D convolutions are computationallyintensive. [8] Propose the Temporal Shift Module (TSM) for train-efficiency, and [5] proposes DeepCompression forinference-efficiency. TSM is a simple yet effective way to gain near 3D convolution performance at 2D convolutioncomputation cost. We apply these efficiency techniques to a newly labeled activity recognition data set through transferlearning. Our labeling strategy is meant to create highly temporal activity. We benchmark against a 2D ResNet50 backbonetrained on individual frames, and a multilayer 3DCNN on multi-frame short videos. Our contributions are: 1. A new highlytemporal activity recognition dataset based on egoHands [1]. 2. results that show a 3D backbone on videos outperforms a2D one on frames. 3. With TSM we achieve 5x train efficiency in run-time with negligible performance loss. 4. WithQuantization alone we achieve 10x inference efficiency in model size with negligible performance loss.
展开▼