首页> 外文会议>International Conference on Computer Vision >Language Features Matter: Effective Language Representations for Vision-Language Tasks
【24h】

Language Features Matter: Effective Language Representations for Vision-Language Tasks

机译:重要的语言功能:视觉语言任务的有效语言表示

获取原文

摘要

Shouldn't language and vision features be treated equally in vision-language (VL) tasks? Many VL approaches treat the language component as an afterthought, using simple language models that are either built upon fixed word embeddings trained on text-only data or are learned from scratch. We conclude that language features deserve more attention, which has been informed by experiments which compare different word embeddings, language models, and embedding augmentation steps on five common VL tasks: image-sentence retrieval, image captioning, visual question answering, phrase grounding, and text-to-clip retrieval. Our experiments provide some striking results; an average embedding language model outperforms a LSTM on retrieval-style tasks; state-of-the-art representations such as BERT perform relatively poorly on vision-language tasks. From this comprehensive set of experiments we can propose a set of best practices for incorporating the language component of vision-language tasks. To further elevate language features, we also show that knowledge in vision-language problems can be transferred across tasks to gain performance with multi-task training. This multi-task training is applied to a new Graph Oriented Vision-Language Embedding (GrOVLE), which we adapt from Word2Vec using WordNet and an original visual-language graph built from Visual Genome, providing a ready-to-use vision-language embedding: http://ai.bu.edu/grovle.
机译:语言和视觉功能在视觉语言(VL)任务中不应该被平等对待吗?许多VL方法使用简单的语言模型将语言组件视为事后的想法,这些语言模型要么建立在对纯文本数据进行训练的固定词嵌入基础上,要么从头开始学习。我们得出结论,语言功能值得更多关注,这是通过对以下五个常见的VL任务进行比较的不同单词嵌入,语言模型和嵌入增强步骤的实验得出的:图像句子检索,图像标题,视觉问题解答,短语基础和文本到剪辑检索。我们的实验提供了一些惊人的结果。在检索式任务方面,平均嵌入语言模型优于LSTM;像BERT这样的最新技术在视觉语言任务上的表现相对较差。从这套全面的实验中,我们可以提出一套最佳实践,以结合视觉语言任务的语言部分。为了进一步提升语言功能,我们还展示了视觉语言问题的知识可以跨任务转移,从而通过多任务训练获得性能。这项多任务训练适用于新的面向图的视觉语言嵌入(GrOVLE),我们使用WordNet从Word2Vec改编而成,并从Visual Genome构建了原始的视觉语言图,提供了现成的视觉语言嵌入:http://ai.bu.edu/grovle。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号