【24h】

Expressing Visual Relationships via Language

机译:通过语言表达视觉关系

获取原文

摘要

Describing images with text is a fundamental problem in vision-language research. Current studies in this domain mostly focus on single image captioning. However, in various real applications (e.g., image editing, difference interpretation, and retrieval), generating relational captions for two images, can also be very useful. This important problem has not been explored mostly due to lack of datasets and effective models. To push forward the research in this direction, we first introduce a new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions. We then propose a new relational speaker model based on an encoder-decoder architecture with static relational attention and sequential multi-head attention. We also extend the model with dynamic relational attention, which calculates visual alignment while decoding. Our models are evaluated on our newly collected and two public datasets consisting of image pairs annotated with relationship sentences. Experimental results, based on both automatic and human evaluation, demonstrate that our model outperforms all baselines and existing methods on all the datasets.~1
机译:用文本描述图像是视觉语言研究中的一个基本问题。当前在该领域的研究主要集中在单图像字幕上。但是,在各种实际应用中(例如,图像编辑,差异解释和检索),为两个图像生成关系字幕也可能非常有用。由于缺少数据集和有效的模型,因此尚未探讨此重要问题。为了朝这个方向推进研究,我们首先引入了一个新的语言指导的图像编辑数据集,其中包含大量的真实图像对以及相应的编辑指令。然后,我们基于具有静态关系注意力和顺序多头注意力的编解码器体系结构,提出了一种新的关系说话者模型。我们还通过动态关系关注扩展了模型,该关系关注计算解码时的视觉对齐方式。我们的模型是在我们新收集的两个公共数据集上进行评估的,两个公共数据集由带关系语句注释的图像对组成。基于自动评估和人工评估的实验结果表明,我们的模型优于所有数据集上的所有基线和现有方法。〜1

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号