Interpretable Video Captioning via Trajectory Structured Localization

机译：通过轨迹结构化本地化可解释的视频字幕

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Automatically describing open-domain videos with natural language are attracting increasing interest in the field of artificial intelligence. Most existing methods simply borrow ideas from image captioning and obtain a compact video representation from an ensemble of global image feature before feeding to an RNN decoder which outputs a sentence of variable length. However, it is not only arduous for the generator to focus on specific salient objects at different time given the global video representation, it is more formidable to capture the fine-grained motion information and the relation between moving instances for more subtle linguistic descriptions. In this paper, we propose a Trajectory Structured Attentional Encoder-Decoder (TSA-ED) neural network framework for more elaborate video captioning which works by integrating local spatial-temporal representation at trajectory level through structured attention mechanism. Our proposed method is based on a LSTM-based encoder-decoder framework, which incorporates an attention modeling scheme to adaptively learn the correlation between sentence structure and the moving objects in videos, and consequently generates more accurate and meticulous statement description in the decoding stage. Experimental results demonstrate that the feature representation and structured attention mechanism based on the trajectory cluster can efficiently obtain the local motion information in the video to help generate a more fine-grained video description, and achieve the state-of-the-art performance on the well-known Charades and MSVD datasets.

机译：用自然语言自动描述开放域视频在人工智能领域引起了越来越多的兴趣。大多数现有方法只是从图像字幕中借鉴思想，并从整体图像特征集合中获得紧凑的视频表示，然后再馈送到输出可变长度句子的RNN解码器。但是，在给定全局视频表示的情况下，对于生成器而言，在不同的时间聚焦于特定的显着对象不仅费劲，而且捕获更细粒度的运动信息以及运动实例之间的关系以进行更细微的语言描述更为艰巨。在本文中，我们提出了一种轨迹结构化注意力编码器/解码器（TSA-ED）神经网络框架，用于更精细的视频字幕，该结构通过通过结构化注意力机制集成轨迹水平上的局部时空表示来工作。我们提出的方法基于基于LSTM的编码器-解码器框架，该框架结合了一种注意力建模方案，可以自适应地学习句子结构与视频中移动对象之间的相关性，从而在解码阶段生成更准确，更细致的语句描述。实验结果表明，基于轨迹聚类的特征表示和结构化注意力机制可以有效地获取视频中的局部运动信息，从而有助于生成更细粒度的视频描述，并在视频效果上达到最新水平。著名的Charades和MSVD数据集。

著录项

来源
《IEEE/CVF Conference on Computer Vision and Pattern Recognition》|2018年|6829-6837|共9页
会议地点 Salt Lake City(US)
作者
Xian Wu; Guanbin Li; Qingxing Cao; Qingge Ji; Liang Lin;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Trajectory; Feature extraction; Decoding; Visualization; Semantics; Recurrent neural networks;

机译：弹道;特征提取;解码;可视化；语义学递归神经网络;

相似文献

外文文献
中文文献
专利

1. Automatic caption localization in compressed video [J] . Yu Zhong, Hongjiang Zhang IEEE Transactions on Pattern Analysis and Machine Intelligence . 2000,第4期

机译：压缩视频中的字幕自动定位
2. A Deep Structured Model for Video Captioning [J] . Vinodhini V., Sathiyabhama B., Sankar S., International journal of gaming and computer-mediated simulations . 2020,第2期

机译：用于视频字幕的深层结构化模型
3. Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism [J] . Guo Dashan, Li Wei, Fang Xiangzhong Neural processing letters . 2017,第1期

机译：通过时空上下文和频道注意机制捕获视频字幕的时间结构
4. Interpretable Video Captioning via Trajectory Structured Localization [C] . Xian Wu, Guanbin Li, Qingxing Cao, IEEE/CVF Conference on Computer Vision and Pattern Recognition . 2018

机译：通过轨迹结构化定位可解释的视频字幕
5. The effect of the use of videos captioning on English as a foreign language (EFL) on college students' language learning in Taiwan (China). [D] . Hwang, Yan-Ling. 2003

机译：在台湾（中国）使用视频字幕作为外语英语（EFL）对大学生语言学习的影响。
6. Localizing Target Structures in Ultrasound Video [O] . R. Kwitt, N. Vasconcelos, S. Razzaque, -1

机译：定位超声视频中的目标结构
7. Accuracy of Sign Interpreting and Real-Time Captioning ofScience Videos for the Delivery of Instruction to Deaf StudentsAccuracy of Sign Interpreting and Real-Time Captioning ofScience Videos for the Delivery of Instruction to Deaf StudentsAccuracy of Sign Interpreting and Real-Time Captioning of Science Videos for the Delivery of Instruction to Deaf Students [O] . Sadler Karen Lee 2009

机译：用于向聋人学生交付教学的科学视频的符号解释和实时字幕的准确性用于向聋人学生交付教学的科学视频的符号解释和实时字幕的准确性向聋哑学生提供指导

Interpretable Video Captioning via Trajectory Structured Localization

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅