Video Action Recognition With an Additional End-to-End Trained Temporal Stream

机译：具有额外的端到端训练时间流的视频动作识别

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Detecting actions in videos requires understanding the temporal relationships among frames. Typical action recognition approaches rely on optical flow estimation methods to convey temporal information to a CNN. Recent studies employ 3D convolutions in addition to optical flow to process the temporal information. While these models achieve slightly better results than two-stream 2D convolutional approaches, they are significantly more complex, requiring more data and time to be trained. We propose an efficient, adaptive batch size distributed training algorithm with customized optimizations for training the two 2D streams. We introduce a new 2D convolutional temporal stream that is trained end-to-end with a neural network. The flexibility to freeze some network layers from training in this temporal stream brings the possibility of ensemble learning with more than one temporal streams. Our architecture that combines three streams achieves the highest accuracies as we know of on UCF101 and HMDB51 by systems that do not pretrain on much larger datasets (e.g., Kinetics). We achieve these results while keeping our spatial and temporal streams 4.67× faster to train than the 3D convolution approaches.

机译：检测视频中的动作需要了解帧之间的时间关系。典型的动作识别方法依赖于光流估计方法将时间信息传达到CNN。最近的研究除光流外还采用3D卷积来处理时间信息。尽管这些模型比两流2D卷积方法取得更好的结果，但它们却要复杂得多，需要训练更多的数据和时间。我们提出了一种高效，自适应的批量大小分布式训练算法，该算法具有针对训练两个2D流的自定义优化。我们介绍了一种新的2D卷积时间流，它使用神经网络进行了端到端的训练。冻结一些网络层以免于在此时间流中进行训练的灵活性带来了使用多个时间流进行集成学习的可能性。正如我们在UCF101和HMDB51上所知道的那样，结合了三个数据流的体系结构可以通过未在更大数据集上进行预训练的系统（例如Kinetics）实现最高的精度。我们获得了这些结果，同时与3D卷积方法相比，将我们的空间和时间流保持了4.67倍的训练速度。

著录项

来源
《IEEE Winter Conference on Applications of Computer Vision》|2019年|51-60|共10页
会议地点
作者
Guojing Cong; Giacomo Domeniconi; Joshua Shapiro; Chih-Chieh Yang; Barry Chen;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Three-dimensional displays; Streaming media; Two dimensional displays; Training; Optical imaging; Estimation; Computer architecture;

机译：三维显示器;流媒体;二维显示器;培训;光学成像;估计;计算机体系结构;

相似文献

外文文献
中文文献
专利

1. End-to-end temporal attention extraction and human action recognition [J] . Hong Zhang, Miao Xin, Shuhang Wang, Machine Vision and Applications . 2018,第7期

机译：端到端的临时注意力提取和人类动作识别
2. Temporal pattern recognition based interactive video-on-demand streaming technique [J] . Jun Pyo Lee, Sang Hee Kim, Young Woo Park International journal of advanced media and communication . 2014,第2a3期

机译：基于时间模式识别的交互式视频点播流技术
3. Three-Stream Network With Bidirectional Self-Attention for Action Recognition in Extreme Low Resolution Videos [J] . Didik Purwanto, Rizard Renanda Adhi Pramono, Yie-Tarng Chen, IEEE signal processing letters . 2019,第8期

机译：具有双向自我关注能力的三流网络，用于超低分辨率视频中的动作识别
4. Video Action Recognition With an Additional End-to-End Trained Temporal Stream [C] . Guojing Cong, Giacomo Domeniconi, Joshua Shapiro, IEEE Winter Conference on Applications of Computer Vision . 2019

机译：视频动作识别与额外的端到端训练的时间流
5. Scalable action recognition in continuous video streams. [D] . Pirsiavash, Hamed. 2012

机译：连续视频流中的可扩展动作识别。
6. Improved Action Recognition with Separable Spatio-Temporal Attention Using Alternative Skeletal and Video Pre-Processing [O] . Pau Climent-Pérez, Francisco Florez-Revuelta 2021

机译：使用替代骨骼和视频预处理改进了可分离的时空关注的动作识别
7. Procedural Generation of Videos to Train Deep Action Recognition Networks [O] . de Souza, César Roberto, Gaidon, Adrien, Cabon, Yohann, 2017

机译：程序生成视频以训练深度行动识别网络
8. Human Action Recognition in Surveillance Videos using Abductive Reasoning on Linear Temporal Logic. [R] . Basu, S., Stagg, M., DiBiano, R., 2012

机译：利用线性时态逻辑的诱导推理对监控视频中的人体行为识别。

Video Action Recognition With an Additional End-to-End Trained Temporal Stream

摘要

著录项

相似文献

相关主题

期刊订阅