首页> 外文会议>IEEE/CVF Conference on Computer Vision and Pattern Recognition >Interpretable Video Captioning via Trajectory Structured Localization
【24h】

Interpretable Video Captioning via Trajectory Structured Localization

机译:通过轨迹结构化本地化可解释的视频字幕

获取原文

摘要

Automatically describing open-domain videos with natural language are attracting increasing interest in the field of artificial intelligence. Most existing methods simply borrow ideas from image captioning and obtain a compact video representation from an ensemble of global image feature before feeding to an RNN decoder which outputs a sentence of variable length. However, it is not only arduous for the generator to focus on specific salient objects at different time given the global video representation, it is more formidable to capture the fine-grained motion information and the relation between moving instances for more subtle linguistic descriptions. In this paper, we propose a Trajectory Structured Attentional Encoder-Decoder (TSA-ED) neural network framework for more elaborate video captioning which works by integrating local spatial-temporal representation at trajectory level through structured attention mechanism. Our proposed method is based on a LSTM-based encoder-decoder framework, which incorporates an attention modeling scheme to adaptively learn the correlation between sentence structure and the moving objects in videos, and consequently generates more accurate and meticulous statement description in the decoding stage. Experimental results demonstrate that the feature representation and structured attention mechanism based on the trajectory cluster can efficiently obtain the local motion information in the video to help generate a more fine-grained video description, and achieve the state-of-the-art performance on the well-known Charades and MSVD datasets.
机译:用自然语言自动描述开放域视频在人工智能领域引起了越来越多的兴趣。大多数现有方法只是从图像字幕中借鉴思想,并从整体图像特征集合中获得紧凑的视频表示,然后再馈送到输出可变长度句子的RNN解码器。但是,在给定全局视频表示的情况下,对于生成器而言,在不同的时间聚焦于特定的显着对象不仅费劲,而且捕获更细粒度的运动信息以及运动实例之间的关系以进行更细微的语言描述更为艰巨。在本文中,我们提出了一种轨迹结构化注意力编码器/解码器(TSA-ED)神经网络框架,用于更精细的视频字幕,该结构通过通过结构化注意力机制集成轨迹水平上的局部时空表示来工作。我们提出的方法基于基于LSTM的编码器-解码器框架,该框架结合了一种注意力建模方案,可以自适应地学习句子结构与视频中移动对象之间的相关性,从而在解码阶段生成更准确,更细致的语句描述。实验结果表明,基于轨迹聚类的特征表示和结构化注意力机制可以有效地获取视频中的局部运动信息,从而有助于生成更细粒度的视频描述,并在视频效果上达到最新水平。著名的Charades和MSVD数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号