Language Features Matter: Effective Language Representations for Vision-Language Tasks

机译：重要的语言功能：视觉语言任务的有效语言表示

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Shouldn't language and vision features be treated equally in vision-language (VL) tasks? Many VL approaches treat the language component as an afterthought, using simple language models that are either built upon fixed word embeddings trained on text-only data or are learned from scratch. We conclude that language features deserve more attention, which has been informed by experiments which compare different word embeddings, language models, and embedding augmentation steps on five common VL tasks: image-sentence retrieval, image captioning, visual question answering, phrase grounding, and text-to-clip retrieval. Our experiments provide some striking results; an average embedding language model outperforms a LSTM on retrieval-style tasks; state-of-the-art representations such as BERT perform relatively poorly on vision-language tasks. From this comprehensive set of experiments we can propose a set of best practices for incorporating the language component of vision-language tasks. To further elevate language features, we also show that knowledge in vision-language problems can be transferred across tasks to gain performance with multi-task training. This multi-task training is applied to a new Graph Oriented Vision-Language Embedding (GrOVLE), which we adapt from Word2Vec using WordNet and an original visual-language graph built from Visual Genome, providing a ready-to-use vision-language embedding: http://ai.bu.edu/grovle.

机译：语言和视觉功能在视觉语言（VL）任务中不应该被平等对待吗？许多VL方法使用简单的语言模型将语言组件视为事后的想法，这些语言模型要么建立在对纯文本数据进行训练的固定词嵌入基础上，要么从头开始学习。我们得出结论，语言功能值得更多关注，这是通过对以下五个常见的VL任务进行比较的不同单词嵌入，语言模型和嵌入增强步骤的实验得出的：图像句子检索，图像标题，视觉问题解答，短语基础和文本到剪辑检索。我们的实验提供了一些惊人的结果。在检索式任务方面，平均嵌入语言模型优于LSTM；像BERT这样的最新技术在视觉语言任务上的表现相对较差。从这套全面的实验中，我们可以提出一套最佳实践，以结合视觉语言任务的语言部分。为了进一步提升语言功能，我们还展示了视觉语言问题的知识可以跨任务转移，从而通过多任务训练获得性能。这项多任务训练适用于新的面向图的视觉语言嵌入（GrOVLE），我们使用WordNet从Word2Vec改编而成，并从Visual Genome构建了原始的视觉语言图，提供了现成的视觉语言嵌入：http：//ai.bu.edu/grovle。

著录项

来源
《International Conference on Computer Vision》|2019年|7473-7482|共10页
会议地点 Seoul(KR)
作者
Andrea Burns; Reuben Tan; Kate Saenko; Stan Sclaroff; Bryan Plummer;
展开▼
作者单位

Boston University;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Task analysis; Visualization; Training; Computer architecture; Bit error rate; Adaptation models; Grounding;

机译：任务分析；可视化；训练;计算机架构;误码率；适应模型；接地线;
入库时间 2022-08-26 14:42:25

相似文献

外文文献
中文文献
专利

1. Vision-language integration using constrained local semantic features [J] . Youssef Tamaazousti, Herve Le Borgne, Adrian Popescu, Computer vision and image understanding . 2017,第octa期

机译：使用受约束的局部语义特征的视觉语言集成
2. Crossed Cerebrocerebellar Language Lateralization: An Additional Diagnostic Feature for Assessing Atypical Language Representation in Presurgical Functional MR Imaging [J] . Orellana C. Mendez, Visch-Brink E., Vernooij M., AJNR. American journal of neuroradiology . 2015,第3期

机译：交叉脑小脑语言横向化：评估术前功能性MR成像中的非典型语言表示的附加诊断功能
3. The role of linguistic features when reading and solving mathematics tasks in different languages [J] . Ewa Bergqvist, Frithjof Theens, Magnus ?sterholm Journal of Mathematical Behavior . 2018,第期

机译：语言特征在不同语言中阅读和解决数学任务时的作用
4. Language Features Matter: Effective Language Representations for Vision-Language Tasks [C] . Andrea Burns, Reuben Tan, Kate Saenko, International Conference on Computer Vision . 2019

机译：语言特征问题：视觉语言任务的有效语言表示
5. Enhanced Vision-Language Navigation by Using Scene Recognition Auxiliary Task [D] . Valenzuela, Raimundo Manterola. 2021

机译：使用场景识别辅助任务增强Vision语言导航
6. Crossed Cerebrocerebellar Language Lateralization: An Additional Diagnostic Feature for Assessing Atypical Language Representation in Presurgical Functional MR Imaging [O] . C. Méndez Orellana, E. Visch-Brink, M. Vernooij, 2015

机译：跨越脑大脑脑语语言横向化：用于评估前型函数MR成像中的非典型语言表示的额外诊断特征
7. Learning to Scale Multilingual Representations for Vision-Language Tasks [O] . Andrea Burns, Donghyun Kim, Derry Wijaya, 2020

机译：学习为愿景语言任务进行规模的多语言表示

Language Features Matter: Effective Language Representations for Vision-Language Tasks

摘要

著录项

相似文献

相关主题

期刊订阅