首页> 外文会议>Asian Conference on Computer Vision >Gated Hierarchical Attention for Image Captioning
【24h】

Gated Hierarchical Attention for Image Captioning

机译:图像字幕的门控分层注意

获取原文

摘要

Attention modules connecting encoder and decoders have been widely applied in the field of object recognition, image captioning, visual question answering and neural machine translation, and significantly improves the performance. In this paper, we propose a bottom-up gated hierarchical attention (GHA) mechanism for image captioning. Our proposed model employs a CNN as the decoder which is able to learn different concepts at different layers, and apparently, different concepts correspond to different areas of an image. Therefore, we develop the GHA in which low-level concepts are merged into high-level concepts and simultaneously low-level attended features pass to the top to make predictions. Our GHA significantly improves the performance of the model that only applies one level attention, e.g., the CIDEr score increases from 0.923 to 0.999, which is comparable to the state-of-the-art models that employ attributes boosting and reinforcement learning (RL). We also conduct extensive experiments to analyze the CNN decoder and our proposed GHA, and we find that deeper decoders cannot obtain better performance, and when the convolutional decoder becomes deeper the model is likely to collapse during training.
机译:连接编码器和解码器的注意力模块已广泛应用于对象识别,图像字幕,视觉问题解答和神经机器翻译领域,并显着提高了性能。在本文中,我们提出了一种自底向上的门控分层注意(GHA)机制来进行图像字幕。我们提出的模型采用CNN作为解码器,它能够在不同层学习不同的概念,而且显然,不同的概念对应于图像的不同区域。因此,我们开发了GHA,其中将低层次的概念合并为高层次的概念,同时将低层次的有人参与功能传递到顶部进行预测。我们的GHA显着提高了仅关注一个级别的模型的性能,例如CIDEr分数从0.923提高到0.999,这与采用属性增强和强化学习(RL)的最新模型相当。我们还进行了广泛的实验来分析CNN解码器和我们提出的GHA,我们发现更深的解码器无法获得更好的性能,并且当卷积解码器变得更深时,模型可能会在训练过程中崩溃。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号