Temporal action detection is a challenging task for detecting various action instances in untrimmed videos. Existing detection approaches are unable to localize the start and end time of action instances precisely. To address this issue, we propose a novel Temporal Deconvolutional Pyramid Network (TDPN), in which a Temporal Decon-volution Fusion (TDF) module in each pyramidal hierarchy is developed to construct strong semantic features of multiple temporal scales for detecting action instances with various durations. In the TDF module, the temporal resolution of high-level feature is expanded by a temporal deconvolution. The expanded high-level features and low-level features are fused by a fusion strategy to form strong semantic features. The fused semantic features with multiple temporal scales are used to predict action categories and boundary offsets simultaneously, which significantly improves the detection performance. Besides, a strict strategy for assigning label is proposed during training to improve the precision of temporal boundaries learned by model. We evaluate our detection approach on two public datasets, i.e., THUMOS14 and MEXaction2. The experimental results have demonstrated that our TDPN model can achieve competitive performance on THUMOS14 and best performance on MEXaction2 compared with the other approaches.
展开▼