首页> 外文期刊>ACM transactions on multimedia computing communications and applications >Constrained LSTM and Residual Attention for Image Captioning
【24h】

Constrained LSTM and Residual Attention for Image Captioning

机译:图像标题的限制LSTM和剩余注意力

获取原文
获取原文并翻译 | 示例

摘要

Visual structure and syntactic structure are essential in images and texts, respectively. Visual structure depicts both entities in an image and their interactions, whereas syntactic structure in texts can reflect the part-of-speech constraints between adjacent words. Most existing methods either use visual global representation to guide the language model or generate captions without considering the relationships of different entities or adjacent words. Thus, their language models lack relevance in both visual and syntactic structure. To solve this problem, we propose a model that aligns the language model to certain visual structure and also constrains it with a specific part-of-speech template. In addition, most methods exploit the latent relationship betweenwords in a sentence and pre-extracted visual regions in an image yet ignore the effects of unextracted regions on predicted words. We develop a residual attention mechanism to simultaneously focus on the preextracted visual objects and unextracted regions in an image. Residual attention is capable of capturing precise regions of an image corresponding to the predicted words considering both the effects of visual objects and unextracted regions. The effectiveness of our entire framework and each proposed module are verified on two classical datasets: MSCOCO and Flickr30k. Our framework is on par with or even better than the stateof-the-art methods and achieves superior performance on COCO captioning Leaderboard.
机译:视觉结构和句法结构分别在图像和文本中必不可少。视觉结构描绘了图像中的两个实体及其交互,而文本中的语法结构可以反映相邻词之间的语音约束。大多数现有方法都使用可视全局表示来指导语言模型或在不考虑不同实体或相邻词的关系的情况下生成标题。因此,他们的语言模型在视觉和句法结构中缺乏相关性。为了解决这个问题,我们提出了一种模型,该模型将语言模型与某些可视结构调整,并将其与特定的语音模板约束。此外,大多数方法利用句子中的Word之间的潜在关系,并在图像中提取预先提取的视觉区域,但忽略了未提出的区域对预测词的影响。我们开发剩余注意机制,同时关注图像中的预先表达的视觉物体和未提出的区域。考虑到视觉物体和未提示区域的效果,残余注意力能够捕获对应于预测词对应的图像的精确区域。我们整个框架的有效性和每个提出的模块都在两个古典数据集:MSCOCO和FLICKR30K上验证了验证。我们的框架与常规的方法有关,甚至更好地达到了最新方法,并在Coco标题排行榜上实现了卓越的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号