
Captioning with Language-Based Attention




The goal of image captioning via machine learning is to automatically learn to provide a free-form description of an image, while focusing on the significant objects in an image. Inspired by recent work on attention in image captioning, we study in this paper different attention mechanisms within a deep learning setting. In contrast to previous research on attention models which focus on applying attention to the image modality, we introduce three language-based attention models. These language-based attention models, which we developed iteratively from simpler RNN-and LSTM-based baseline models, consist of two sub-networks: a deep recurrent neural network for the language modality and a convolutional neural network for the image modality. The language-based attention models learn a joint representation of the language and image modalities, given the image and the previous words in the caption. At test time, novel captions are produced from this learned distribution. We provide a comparative quantitative and qualitative analysis of our three language-based attention models, which outperform the simple baseline models. We validate the effectiveness of our attention models with state-of-the-art performance on the Flickr8k dataset.



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号