首页> 外文期刊>Neural processing letters >Image Captioning Using Region-Based Attention Joint with Time-Varying Attention
【24h】

Image Captioning Using Region-Based Attention Joint with Time-Varying Attention

机译:使用基于区域的注意力联合时变注意力的图像字幕

获取原文
获取原文并翻译 | 示例

摘要

In this work, we propose a novel region-based and time-varying attention network (RTAN) model for image captioning, which can determine where and when to attend to images. The RTAN is composed of region-based attention network (RAN) and time-varying attention network (TAN). For the RAN part, we integrate region proposal network with soft attention mechanism, so that it is able to locate the accurate positions of objects in an image and focus on the object most relevant to the next word. In the TAN, we design a time-varying gate to determine whether visual information is needed to generate the next word. For example, when the next word is a non-visual word, e.g. "the" or "to", our model would predict the next word based more on the semantic information instead of visual information. Compared with the existing methods, the advantage of the proposed RTAN model is twofold: (1) the RTAN can extract more discriminative visual information; (2) it can attend to only semantic information when predicting the non-visual words. The effectiveness of RTAN is verified on MSCOCO and Flicker30k datasets.
机译:在这项工作中,我们提出了一种新颖的基于区域的时变注意力网络(RTAN)模型进行图像字幕,该模型可以确定何时何地观看图像。 RTAN由基于区域的注意力网络(RAN)和随时间变化的注意力网络(TAN)组成。对于RAN部分,我们将区域提议网络与软注意力机制集成在一起,以便能够定位图像中对象的准确位置,并专注于与下一个单词最相关的对象。在TAN中,我们设计了一个时变门来确定是否需要视觉信息才能生成下一个单词。例如,当下一个单词是非视觉单词时,例如“ the ”或“ to ”,我们的模型将更多地根据语义信息而不是视觉信息来预测下一个单词。与现有方法相比,所提出的RTAN模型具有两方面的优点:(1)RTAN可以提取更多的判别性视觉信息。 (2)在预测非视觉单词时,它只能关注语义信息。 RTAN的有效性已在MSCOCO和Flicker30k数据集上得到验证。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号