In this paper, we present an algorithm for action recognition that uses only depth maps. We propose a set of handcrafted features to describe person's shape in noisy depth maps. We extract features by a convolutional neural network (CNN), which has been trained on multi-channel input sequences consisting of two consecutive depth maps and depth map projected onto an orthogonal Cartesian plane, We show experimentally that combining features extracted by the CNN and proposed features leads to better classification performance. We demonstrate that an LSTM trained on such aggregated features achieves state-of-the-art classification performance on UTKinect dataset. We propose a global statistical descriptor of temporal features. We show experimentally that such a descriptor has high discriminative power on time-series of concatenated CNN features with handcrafted features.
展开▼