首页> 外文期刊>IEEE transactions on multimedia >Video Captioning With Attention-Based LSTM and Semantic Consistency
【24h】

Video Captioning With Attention-Based LSTM and Semantic Consistency

机译:具有基于注意的LSTM和语义一致性的视频字幕

获取原文
获取原文并翻译 | 示例
           

摘要

Recent progress in using long short-term memory (LSTM) for image captioning has motivated the exploration of their applications for video captioning. By taking a video as a sequence of features, an LSTM model is trained on video-sentence pairs and learns to associate a video to a sentence. However, most existing methods compress an entire video shot or frame into a static representation, without considering attention mechanism which allows for selecting salient features. Furthermore, existing approaches usually model the translating error, but ignore the correlations between sentence semantics and visual content. To tackle these issues, we propose a novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences. This framework integrates attention mechanism with LSTM to capture salient structures of video, and explores the correlation between multimodal representations (i.e., words and visual content) for generating sentences with rich semantic content. Specifically, we first propose an attention mechanism that uses the dynamic weighted sum of local two-dimensional convolutional neural network representations. Then, an LSTM decoder takes these visual features at time and the word-embedding feature at time 1 to generate important words. Finally, we use multimodal embedding to map the visual and sentence features into a joint space to guarantee the semantic consistence of the sentence description and the video visual content. Experiments on the benchmark datasets demonstrate that our method using single feature can achieve competitive or even better results than the state-of-the-art baselines for video captioning in both BLEU and METEOR.
机译:使用长短期记忆(LSTM)进行图像字幕的最新进展激发了人们对其视频字幕应用的探索。通过将视频作为特征序列,可以在视频句子对上训练LSTM模型,并学习将视频与句子相关联。然而,大多数现有方法将整个视频镜头或帧压缩为静态表示,而不考虑允许选择突出特征的注意力机制。此外,现有的方法通常为翻译错误建模,但是忽略了句子语义和视觉内容之间的相关性。为了解决这些问题,我们提出了一个新颖的端到端框架,名为aLSTM,这是一种具有语义一致性的基于注意力的LSTM模型,可以将视频转换为自然句子。该框架将注意力机制与LSTM集成在一起,以捕获视频的显着结构,并探索多模态表示(即单词和视觉内容)之间的相关性,以生成具有丰富语义内容的句子。具体而言,我们首先提出一种使用局部二维卷积神经网络表示形式的动态加权总和的注意机制。然后,LSTM解码器在时间上获取这些视觉特征,并在时间1上获取词嵌入功能,以生成重要的词。最后,我们使用多模式嵌入将视觉和句子特征映射到联合空间中,以确保句子描述和视频视觉内容的语义一致性。在基准数据集上进行的实验表明,与使用BLEU和METEOR的视频字幕的最新基准相比,使用单一功能的方法可以实现竞争甚至更好的结果。

著录项

  • 来源
    《IEEE transactions on multimedia》 |2017年第9期|2045-2055|共11页
  • 作者单位

    Center of Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China;

    Center of Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China;

    Department of Computer Science, Columbia University, New York, NY, USA;

    Center of Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China;

    Center of Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Visualization; Semantics; Two dimensional displays; Neural networks; Computational modeling; Feature extraction; Correlation;

    机译:可视化;语义;二维显示;神经网络;计算建模;特征提取;相关;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号