首页> 外文期刊>Neural processing letters >Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM
【24h】

Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM

机译:基于BILSTM自适应地转换辅助属性和文本嵌入的视频字幕

获取原文
获取原文并翻译 | 示例

摘要

Automatically generating captions for videos faces a huge challenge since it is a cross-modal cross task that involves vision and texts. Most of the existing models generate the captioning words merely based on the video visual content features, ignoring the important underlying semantic information. The relationship between explicit semantics and hidden visual content is not holistically exploited, thus hardly describing fine-grained caption accurately from a global view. To better extract and integrate the semantic information, we propose a novel encoder-decoder framework of bi-directional long short-term memory with attention model and conversion gate (BiLSTM-CG), which transfers auxiliary attributes and then generates detailed captioning. Specifically, we extract semantic attributes from sliced frames in a multiple-instance learning (MIL) manner. MIL algorithms attempt to learn a classification function that can predict the labels of bags and/or instances in the visual content. In the encoding stage, we adopt 2D and 3D convolutional neural networks to encode video clips, and then feed the concatenate features into a BiLSTM. In decoding stage, we design a CG to adaptively fuse semantic attributes into hidden features at word level, and a CG can convert auxiliary attributes and textual embedding for video captioning. Furthermore, the CG has an ability to automatically decide the optimal time stamp to capture the explicit semantic or rely on the hidden states of the language model to generate the next word. Extensive experiments conducted on the MSR-VTT and MSVD video captioning datasets demonstrate the effectiveness of our method compared with state-of-the-art approaches.
机译:自动生成视频字幕面临巨大的挑战,因为它是涉及视觉和文本的跨模型交叉任务。大多数现有模型仅基于视频视觉内容特征生成标题单词,忽略了重要的基础语义信息。显式语义与隐藏视觉内容之间的关系不是全能利用,从而从全局视图中准确地描述细粒度标题。为了更好地提取和集成语义信息,我们提出了一种新颖的编码器 - 解码器的双向长短短期存储器,其具有注意模型和转换门(BILSTM-CG),其传输辅助属性,然后生成详细的标题。具体地,我们在多实例学习(MIL)方式中从切片帧中提取语义属性。 MIL算法尝试学习可以预测视觉内容中的袋子和/或实例标签的分类功能。在编码阶段,我们采用2D和3D卷积神经网络来编码视频剪辑,然后将连接功能馈送到BILSTM中。在解码阶段,我们设计CG以使语义属性自适应地将语义属性融合到Word级别的隐藏特征中,并且CG可以转换辅助属性和用于视频字幕的文本嵌入。此外,CG能够自动决定最佳时间戳来捕获显式语义或依赖语言模型的隐藏状态以生成下一个单词。在MSR-VTT和MSVD视频字幕数据集上进行的广泛实验证明了与最先进的方法相比的方法的有效性。

著录项

  • 来源
    《Neural processing letters》 |2020年第3期|2353-2369|共17页
  • 作者单位

    School of Computer Science and Technology Wuhan University of Technology Wuhan 430070 Hubei China;

    School of Computer Science and Technology Wuhan University of Technology Wuhan 430070 Hubei China Hubei Key Laboratory of Transportation Internet of Things Wuhan University of Technology Wuhan 430070 Hubei China;

    School of Computer Science and Technology Wuhan University of Technology Wuhan 430070 Hubei China Hubei Key Laboratory of Transportation Internet of Things Wuhan University of Technology Wuhan 430070 Hubei China;

    School of Computer Science and Technology Wuhan University of Technology Wuhan 430070 Hubei China;

    School of Computer Science and Technology Wuhan University of Technology Wuhan 430070 Hubei China;

    School of Computer Science and Technology Wuhan University of Technology Wuhan 430070 Hubei China Hubei Key Laboratory of Transportation Internet of Things Wuhan University of Technology Wuhan 430070 Hubei China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Video captioning; Bi-directional long short-term memory; Multiple-instance learning; Semantic fine-grained attributes; Attention mechanism; Conversion gate;

    机译:视频标题;双向长短期记忆;多实例学习;语义细粒度的属性;注意机制;转换门;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号