首页> 外文期刊>IEEE Transactions on Pattern Analysis and Machine Intelligence >Deep Visual-Semantic Alignments for Generating Image Descriptions
【24h】

Deep Visual-Semantic Alignments for Generating Image Descriptions

机译:用于生成图像描述的深度视觉语义对齐

获取原文
获取原文并翻译 | 示例

摘要

We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks (RNN) over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions outperform retrieval baselines on both full images and on a new dataset of region-level annotations. Finally, we conduct large-scale analysis of our RNN language model on the Visual Genome dataset of 4.1 million captions and highlight the differences between image and region-level caption statistics.
机译:我们提出了一个模型,该模型生成图像及其区域的自然语言描述。我们的方法利用图像数据集及其句子描述来了解语言和视觉数据之间的模态对应关系。我们的对齐模型基于图像区域上的卷积神经网络,句子上的双向递归神经网络(RNN)和结构化目标的新颖组合,该结构化目标通过多模态嵌入来对齐两种模态。然后,我们描述一种多模态递归神经网络体系结构,该体系结构使用推断的比对来学习生成图像区域的新颖描述。我们证明,在Flickr8K,Flickr30K和MSCOCO数据集的检索实验中,我们的对齐模型可产生最先进的结果。然后,我们显示生成的描述在完整图像和区域级注释的新数据集上都优于检索基准。最后,我们在410万字幕的Visual Genome数据集上对RNN语言模型进行了大规模分析,并强调了图像和区域级字幕统计之间的差异。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号