Jointly Modeling Embedding and Translation to Bridge Video and Language

机译：联合建模嵌入和翻译桥梁视频和语言

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Automatically describing video content with natural language is a fundamental challenge of computer vision. Recurrent Neural Networks (RNNs), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with the given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content. The experiments on YouTube2Text dataset show that our proposed LSTM-E achieves to-date the best published performance in generating natural sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. Superior performances are also reported on two movie description datasets (M-VAD and MPII-MD). In addition, we demonstrate that LSTM-E outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.

机译：自动描述具有自然语言的视频内容是计算机愿景的根本挑战。模型序列动态的经常性神经网络（RNN）吸引了对视觉解释的越来越关注。但是，大多数现有方法用给定的先前单词和视觉内容在本地生成一句话，而句子语义与视觉内容之间的关系不是全能地利用。结果，所生成的句子可以是上下文上的，而是语义（例如，受试者，动词或对象）不是真的。本文提出了一种新颖的统一框架，名为L长的短期内存，具有视觉语义嵌入（LSTM-E），可以同时探索LSTM和视觉语义嵌入的学习。前者旨在局部地最大化生成前一词和视觉内容的下一个单词的概率，而后者是创建用于执行整个句子和视觉内容的语义之间的关系的可视语义嵌入空间。 YouTube2Text数据集的实验表明，我们提出的LSTM-E分别在生成自然句子中实现了最佳公布的绩效：45.3％和31.0％，分别在Bleu @ 4和流星方面。在两部电影描述数据集（M-VAD和MPII-MD）上还报告了优越的性能。此外，我们证明LSTM-E在预测主语动词 - 对象（SVO）三元组时优于几种最先进的技术。

著录项

来源
《IEEE Conference on Computer Vision and Pattern Recognition》|2016年|4545-5296p|共9页
会议地点
作者
Yingwei Pan; Tao Mei; Ting Yao; Houqiang Li; Yong Rui;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP391.41-53;
关键词

相似文献

外文文献
中文文献
专利

1. Efficient Embedded Decoding of Neural Network Language Models in a Machine Translation System [J] . Francisco Zamora-Martinez, Maria Jose Castro-Bleda International Journal of Neural Systems . 2018,第9期

机译：高效嵌入式解码机器翻译系统中的神经网络语言模型
2. Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval [J] . Wessel Kraaij, Jian-Yun Nie, Michel Simard Computational linguistics . 2003,第3期

机译：在跨语言信息检索中嵌入基于Web的统计翻译模型
3. New beam-to-beam joint with concrete embedding for composite bridges Experimental study and finite element modelling [J] . Hugues Somja, SaoSerey Kaing, Alain Lachal Journal of Constructional Steel Research . 2012,第期

机译：新型复合桥梁-梁结合节点的试验研究与有限元建模
4. Jointly Modeling Embedding and Translation to Bridge Video and Language [C] . Yingwei Pan, Tao Mei, Ting Yao, IEEE Conference on Computer Vision and Pattern Recognition . 2016

机译：联合建模嵌入和翻译以桥接视频和语言
5. Jointly Learning Knowledge Graph Embeddings, Fine Grain Entity Types and Language Models [D] . Patel, Rajat Hareshkumar. 2020

机译：联合学习知识图形嵌入，精细谷物实体类型和语言模型
6. Dynamical and Mechanistic Reconstructive Approaches of T Lymphocyte Dynamics: Using Visual Modeling Languages to Bridge the Gap between Immunologists Theoreticians and Programmers [O] . Véronique Thomas-Vaslin, Adrien Six, Jean-Gabriel Ganascia, 2013

机译：T淋巴细胞动力学的动力学和机械重建方法：使用视觉建模语言弥合免疫学家理论学家和程序员之间的差距
7. Jointly Modeling Embedding and Translation to Bridge Video and Language [O] . Pan, Yingwei, Mei, Tao, Yao, Ting, 2015

机译：嵌入式翻译与桥梁视频与语言的联合建模

Jointly Modeling Embedding and Translation to Bridge Video and Language

摘要

著录项

相似文献

相关主题

期刊订阅