首页> 外文会议>IEEE Conference on Computer Vision and Pattern Recognition >Jointly Modeling Embedding and Translation to Bridge Video and Language
【24h】

Jointly Modeling Embedding and Translation to Bridge Video and Language

机译:联合建模嵌入和翻译以桥接视频和语言

获取原文

摘要

Automatically describing video content with natural language is a fundamental challenge of computer vision. Re-current Neural Networks (RNNs), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with the given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content. The experiments on YouTube2Text dataset show that our proposed LSTM-E achieves to-date the best published performance in generating natural sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. Superior performances are also reported on two movie description datasets (M-VAD and MPII-MD). In addition, we demonstrate that LSTM-E outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.
机译:用自然语言自动描述视频内容是计算机视觉的一项基本挑战。模拟序列动力学的递归神经网络(RNN)在视觉解释上引起了越来越多的关注。然而,大多数现有方法在本地产生具有给定的先前单词和视觉内容的单词,而句子语义和视觉内容之间的关系并未得到全面利用。结果,所生成的句子在上下文上可能是正确的,但是语义(例如,主语,动词或宾语)不是正确的。本文提出了一个新颖的统一框架,名为带有视觉语义嵌入的长短期记忆(LSTM-E),它可以同时探索LSTM和视觉语义嵌入的学习。前者旨在在给定先前单词和视觉内容的情况下局部最大化生成下一个单词的可能性,而后者旨在创建视觉语义嵌入空间,以加强整个句子的语义与视觉内容之间的关系。 YouTube2Text数据集上的实验表明,我们提出的LSTM-E在生成自然句子方面实现了迄今为止最佳的发布性能:分别以BLEU @ 4和METEOR而言分别为45.3%和31.0%。在两个电影描述数据集(M-VAD和MPII-MD)上也报告了出色的性能。此外,我们证明了LSTM-E在预测主语-宾语-宾语(SVO)三连音方面优于某些最新技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号