首页> 外文会议>IEEE Winter Conference on Applications of Computer Vision >Video Action Recognition With an Additional End-to-End Trained Temporal Stream
【24h】

Video Action Recognition With an Additional End-to-End Trained Temporal Stream

机译:具有额外的端到端训练时间流的视频动作识别

获取原文

摘要

Detecting actions in videos requires understanding the temporal relationships among frames. Typical action recognition approaches rely on optical flow estimation methods to convey temporal information to a CNN. Recent studies employ 3D convolutions in addition to optical flow to process the temporal information. While these models achieve slightly better results than two-stream 2D convolutional approaches, they are significantly more complex, requiring more data and time to be trained. We propose an efficient, adaptive batch size distributed training algorithm with customized optimizations for training the two 2D streams. We introduce a new 2D convolutional temporal stream that is trained end-to-end with a neural network. The flexibility to freeze some network layers from training in this temporal stream brings the possibility of ensemble learning with more than one temporal streams. Our architecture that combines three streams achieves the highest accuracies as we know of on UCF101 and HMDB51 by systems that do not pretrain on much larger datasets (e.g., Kinetics). We achieve these results while keeping our spatial and temporal streams 4.67× faster to train than the 3D convolution approaches.
机译:检测视频中的动作需要了解帧之间的时间关系。典型的动作识别方法依赖于光流估计方法将时间信息传达到CNN。最近的研究除光流外还采用3D卷积来处理时间信息。尽管这些模型比两流2D卷积方法取得更好的结果,但它们却要复杂得多,需要训练更多的数据和时间。我们提出了一种高效,自适应的批量大小分布式训练算法,该算法具有针对训练两个2D流的自定义优化。我们介绍了一种新的2D卷积时间流,它使用神经网络进行了端到端的训练。冻结一些网络层以免于在此时间流中进行训练的灵活性带来了使用多个时间流进行集成学习的可能性。正如我们在UCF101和HMDB51上所知道的那样,结合了三个数据流的体系结构可以通过未在更大数据集上进行预训练的系统(例如Kinetics)实现最高的精度。我们获得了这些结果,同时与3D卷积方法相比,将我们的空间和时间流保持了4.67倍的训练速度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号