首页> 外文期刊>Computer vision and image understanding >Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language
【24h】

Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language

机译:分层和多模式视频字幕:发现视觉的多模式知识并将其转移到语言

获取原文
获取原文并翻译 | 示例

摘要

Recently, video captioning has achieved significant progress through the advances of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Given a video, deep learning approach is applied to encode the visual information and generate the corresponding caption. However, this direct visual to textual translation ignores the rich intermediate description, such as objects, scenes, actions, etc. In this paper, we proposed to discover and integrate the rich and primeval external knowledge (i.e., frame-based image caption) to benefit the video caption task. We propose a Hierarchical & Multimodal Video Caption (HMVC) model to jointly learn the dynamics within both visual and textual modalities for video caption task, which infers an arbitrary length sentence according to the input video with arbitrary number of frames. Specifically, we argue that the module for latent semantic discovery transfers external knowledge to generate complex and helpful complementary cues. We comprehensively evaluate the HMVC model on the Microsoft Video Description Corpus (MSVD), the MPII Movie Description Dataset (MPII-MD), and the novel dataset for 2016 MSR Video to Text challenge (MSR-VTT), and have attained a competitive performance. In addition, we evaluate the generalization properties of the proposed model by fine-tuning and evaluating the model on different datasets. To the best of our knowledge, this is the first time such analysis has been applied for the video caption task.
机译:最近,通过卷积神经网络(CNN)和递归神经网络(RNN)的发展,视频字幕已取得了重大进展。给定视频,将应用深度学习方法对视觉信息进行编码并生成相应的字幕。但是,这种直接的视觉到文本的翻译忽略了丰富的中间描述,例如对象,场景,动作等。在本文中,我们建议发现并整合丰富而原始的外部知识(即基于帧的图像标题)以有利于视频字幕任务。我们提出了一种分层和多模式视频字幕(HMVC)模型,以共同学习视频字幕任务在视觉和文本模态下的动态,该模型根据输入视频的任意数量的帧来推断任意长度的句子。具体来说,我们认为潜在的语义发现模块将外部知识转移以生成复杂且有用的互补线索。我们在Microsoft视频描述语料库(MSVD),MPII电影描述数据集(MPII-MD)和2016 MSR视频到文本挑战(MSR-VTT)的新颖数据集上对HMVC模型进行了全面评估,并获得了竞争优势。此外,我们通过在不同数据集上对模型进行微调和评估来评估所提出模型的泛化性质。据我们所知,这是第一次将此类分析应用于视频字幕任务。

著录项

  • 来源
    《Computer vision and image understanding》 |2017年第10期|113-125|共13页
  • 作者单位

    School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China;

    School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China;

    Smart Systems Institute, National University of Singapore, Singapore;

    NUS Graduate School for Integrative Sciences and Engineering National University of Singapore, Singapore;

    School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China;

    School of Computing, National University of Singapore, Singapore;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Video to text; Semantic discovery; Multi-modal fusion; Deep learning;

    机译:视频到文本;语义发现;多模态融合;深度学习;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号