首页> 外文会议>International natural language generation conference >What Goes Into A Word: Generating Image Descriptions With Top-Down Spatial Knowledge
【24h】

What Goes Into A Word: Generating Image Descriptions With Top-Down Spatial Knowledge

机译:一言以蔽之:利用自上而下的空间知识生成图像描述

获取原文

摘要

Generating grounded image descriptions requires associating linguistic units with their corresponding visual clues. A common method is to train a decoder language model with attention mechanism over convolutional visual features. Attention weights align the stratified visual features arranged by their location with tokens, most commonly words, in the target description. However, words such as spatial relations (e.g. next to and under) are not directly referring to geometric arrangements of pixels but to complex geometric and conceptual representations. The aim of this paper is to evaluate what representations facilitate generating image descriptions with spatial relations and lead to better grounded language generation. In particular, we investigate the contribution of four different representational modalities in generating relational referring expressions: (ⅰ) (pre-trained) convolutional visual features, (ⅱ) spatial attention over visual features, (ⅲ) top-down geometric relational knowledge between objects, and (ⅳ) world knowledge captured by contextual em-beddings in language models.
机译:生成扎实的图像描述需要将语言单元与其相应的视觉线索相关联。一种常见的方法是通过卷积视觉特征上的注意力机制训练解码器语言模型。注意权重将按其位置排列的分层视觉特征与目标描述中的标记(通常是单词)对齐。然而,诸如空间关系(例如,在其附近和在其下)之类的词并不直接指代像素的几何布置,而是指复杂的几何和概念表示。本文的目的是评估哪些表示形式有助于生成具有空间关系的图像描述,并导致更好的扎实的语言生成。特别是,我们调查了四种不同的表示形式在生成关系引用表达式中的作用:(ⅰ)(预训练的)卷积视觉特征,(ⅱ)视觉特征上的空间注意力,(ⅲ)对象之间自上而下的几何关系知识以及(ⅳ)语言模型中上下文嵌入所捕获的世界知识。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号