In this paper, we propose a strong two-stream point cloud sequence network VirtualActionNet for 3D human action recognition. In the data preprocessing stage, we transform the depth sequence into a point cloud sequence as the input of our VirtualActionNet. In order to encode intra-frame appearance structures, static point cloud technologies are first employed as a virtual action generation sequence module to abstract the point cloud sequence into a virtual action sequence. Then, a two-stream network framework is presented to model the virtual action sequence. Specifically, we design an appearance stream module for aggregating all the appearance information preserved in each virtual action frame. Moreover, a motion stream module is introduced to capture dynamic changes along the time dimension. Finally, a joint loss strategy is adopted during data training to improve the action prediction accuracy of the two-stream network. Extensive experiments on three publicly available datasets demonstrate the effectiveness of the proposed VirtualActionNet.
展开▼