...
首页> 外文期刊>IEEE Transactions on Pattern Analysis and Machine Intelligence >Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts
【24h】

Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts

机译:统一在哪里看到和告诉什么:基于区域的注意力和特定于场景的上下文的图像字幕

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Recent progress on automatic generation of image captions has shown that it is possible to describe the most salient information conveyed by images with accurate and meaningful sentences. In this paper, we propose an image captioning system that exploits the parallel structures between images and sentences. In our model, the process of generating the next word, given the previously generated ones, is aligned with the visual perception experience where the attention shifts among the visual regions-such transitions impose a thread of ordering in visual perception. This alignment characterizes the flow of latent meaning, which encodes what is semantically shared by both the visual scene and the text description. Our system also makes another novel modeling contribution by introducing scene-specific contexts that capture higher-level semantic information encoded in an image. The contexts adapt language models for word generation to specific scene types. We benchmark our system and contrast to published results on several popular datasets, using both automatic evaluation metrics and human evaluation. We show that either region-based attention or scene-specific contexts improves systems without those components. Furthermore, combining these two modeling ingredients attains the state-of-the-art performance.
机译:自动生成图像字幕的最新进展表明,可以用准确而有意义的句子描述图像传达的最显着信息。在本文中,我们提出了一种图像字幕系统,该系统利用了图像和句子之间的并行结构。在我们的模型中,在给定先前生成的单词的情况下,生成下一个单词的过程与视觉感知体验保持一致,在视觉体验中,注意力在视觉区域之间转移-这种过渡强加了视觉感知的顺序性。这种对齐方式表征了潜在含义的流,它对视觉场景和文本描述在语义上共享的内容进行编码。我们的系统还通过引入特定于场景的上下文(捕获捕获在图像中编码的高级语义信息)做出了另一种新颖的建模贡献。上下文使用于单词生成的语言模型适应特定的场景类型。我们使用自动评估指标和人工评估来对我们的系统进行基准测试,并与多个流行数据集上的已发布结果进行对比。我们显示,无论是基于区域的注意力还是特定于场景的上下文,都可以改善没有这些组件的系统。此外,将这两种建模要素结合使用可获得最先进的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号