Jointly Modeling Embedding and Translation to Bridge Video and Language

机译：联合建模嵌入和翻译以桥接视频和语言

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Automatically describing video content with natural language is a fundamental challenge of computer vision. Re-current Neural Networks (RNNs), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with the given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content. The experiments on YouTube2Text dataset show that our proposed LSTM-E achieves to-date the best published performance in generating natural sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. Superior performances are also reported on two movie description datasets (M-VAD and MPII-MD). In addition, we demonstrate that LSTM-E outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.

机译：用自然语言自动描述视频内容是计算机视觉的一项基本挑战。模拟序列动力学的递归神经网络（RNN）在视觉解释上引起了越来越多的关注。然而，大多数现有方法在本地产生具有给定的先前单词和视觉内容的单词，而句子语义和视觉内容之间的关系并未得到全面利用。结果，所生成的句子在上下文上可能是正确的，但是语义（例如，主语，动词或宾语）不是正确的。本文提出了一个新颖的统一框架，名为带有视觉语义嵌入的长短期记忆（LSTM-E），它可以同时探索LSTM和视觉语义嵌入的学习。前者旨在在给定先前单词和视觉内容的情况下局部最大化生成下一个单词的可能性，而后者旨在创建视觉语义嵌入空间，以加强整个句子的语义与视觉内容之间的关系。 YouTube2Text数据集上的实验表明，我们提出的LSTM-E在生成自然句子方面实现了迄今为止最佳的发布性能：分别以BLEU @ 4和METEOR而言分别为45.3％和31.0％。在两个电影描述数据集（M-VAD和MPII-MD）上也报告了出色的性能。此外，我们证明了LSTM-E在预测主语-宾语-宾语（SVO）三连音方面优于某些最新技术。

著录项

来源
《IEEE Conference on Computer Vision and Pattern Recognition》|2016年|4594-4602|共9页
会议地点
作者
Yingwei Pan; Tao Mei; Ting Yao; Houqiang Li; Yong Rui;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Semantics; Visualization; Coherence; Neural networks; Feature extraction; Loss measurement; Computer vision;

机译：语义;可视化;相干性;神经网络;特征提取;损耗测量;计算机视觉;

相似文献

外文文献
中文文献
专利

1. Efficient Embedded Decoding of Neural Network Language Models in a Machine Translation System [J] . Francisco Zamora-Martinez, Maria Jose Castro-Bleda International Journal of Neural Systems . 2018,第9期

机译：高效嵌入式解码机器翻译系统中的神经网络语言模型
2. Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval [J] . Wessel Kraaij, Jian-Yun Nie, Michel Simard Computational linguistics . 2003,第3期

机译：在跨语言信息检索中嵌入基于Web的统计翻译模型
3. New beam-to-beam joint with concrete embedding for composite bridges Experimental study and finite element modelling [J] . Hugues Somja, SaoSerey Kaing, Alain Lachal Journal of Constructional Steel Research . 2012,第期

机译：新型复合桥梁-梁结合节点的试验研究与有限元建模
4. Jointly Modeling Embedding and Translation to Bridge Video and Language [C] . Yingwei Pan, Tao Mei, Ting Yao, IEEE Conference on Computer Vision and Pattern Recognition . 2016

机译：联合建模嵌入和翻译桥梁视频和语言
5. Jointly Learning Knowledge Graph Embeddings, Fine Grain Entity Types and Language Models [D] . Patel, Rajat Hareshkumar. 2020

机译：联合学习知识图形嵌入，精细谷物实体类型和语言模型
6. Dynamical and Mechanistic Reconstructive Approaches of T Lymphocyte Dynamics: Using Visual Modeling Languages to Bridge the Gap between Immunologists Theoreticians and Programmers [O] . Véronique Thomas-Vaslin, Adrien Six, Jean-Gabriel Ganascia, 2013

机译：T淋巴细胞动力学的动力学和机械重建方法：使用视觉建模语言弥合免疫学家理论学家和程序员之间的差距
7. Jointly Modeling Embedding and Translation to Bridge Video and Language [O] . Pan, Yingwei, Mei, Tao, Yao, Ting, 2015

机译：嵌入式翻译与桥梁视频与语言的联合建模

Jointly Modeling Embedding and Translation to Bridge Video and Language

摘要

著录项

相似文献

相关主题

期刊订阅