First person action recognition is an active research area with increasingly popular wearable devices. Action classification for first person video (FPV) is more challenging than conventional action classification due to strong egocentric motions, frequent changes of viewpoints, and diverse global motion patterns. To tackle these challenges, we introduce a two-stream convolutional neural network that improves action recognition via long-term fusion pooling operators. The proposed method effectively captures the temporal structure of actions by leveraging a series of frame-wise features of both appearance and motion in actions. Our experiments validate the effect of the feature pooling operators, and show that the proposed method achieves state-of-the-art performance on standard action datasets. (c) 2018 Elsevier B.V. All rights reserved.
展开▼