首页> 外文期刊>Neurocomputing >VD-SAN: Visual-Densely Semantic Attention Network for Image Caption Generation
【24h】

VD-SAN: Visual-Densely Semantic Attention Network for Image Caption Generation

机译:VD-SAN:用于图像字幕生成的视觉密集语义注意网络

获取原文
获取原文并翻译 | 示例

摘要

Recently, attribute has demonstrated its effectiveness in guiding image captioning system. However, most attributes based image captioning methods treat the attributes prediction task as a separate task and rely on a standalone stage to obtain the attributes for the given image, e.g., a pre-trained network like Fully Convolutional Neural Network (FCN) is usually adopted. Inherently, they ignore the correlation between the attribute prediction task and image representation extraction task, and at the same time increases the complexity of the image captioning system. In this paper, we aim to couple the attributes prediction stage and image representation extraction stage tightly and propose a novel and efficient image captioning framework called Visual-Densely Semantic Attention Network(VD-SAN). In particular, the whole captioning system consists of shared convolutional layers from Dense Convolutional Network (DenseNet), which are further split into a semantic attributes prediction branch and an image feature extraction branch, two semantic attention models, and a long short-term memory networks (LSTM) for caption generation. To evaluate the proposed architecture, we construct Flickr30K-ATT and MS-COCO-ATT datasets based on the original popular image caption datasets Flickr30K and MS COCO respectively, and each image from Flickr30K-ATT or MS-COCO-ATT is annotated with an attribute list in addition to the corresponding caption. Empirical results demonstrate that our captioning system can achieve significant improvements over state-of-the-art approaches. (c) 2018 Elsevier B.V. All rights reserved.
机译:最近,attribute已证明其在指导图像字幕系统中的有效性。然而,大多数基于属性的图像字幕方法将属性预测任务视为单独的任务,并依赖于独立的阶段来获取给定图像的属性,例如,通常采用像全卷积神经网络(FCN)这样的预训练网络。 。他们固有地忽略了属性预测任务和图像表示提取任务之间的相关性,同时增加了图像字幕系统的复杂性。在本文中,我们旨在将属性预测阶段和图像表示提取阶段紧密耦合,并提出一种新颖且有效的图像字幕框架,称为视觉密集语义注意网络(VD-SAN)。特别地,整个字幕系统由来自密集卷积网络(DenseNet)的共享卷积层组成,该层进一步分为语义属性预测分支和图像特征提取分支,两个语义注意模型和一个长短期记忆网络。 (LSTM)用于字幕生成。为了评估所提出的体系结构,我们分别基于原始的流行图像标题数据集Flickr30K和MS COCO构造了Flickr30K-ATT和MS-COCO-ATT数据集,并且对Flickr30K-ATT或MS-COCO-ATT的每个图像进行了注释除了相应的标题之外,还列出。实证结果表明,我们的字幕系统可以在最先进的方法上取得重大进步。 (c)2018 Elsevier B.V.保留所有权利。

著录项

  • 来源
    《Neurocomputing》 |2019年第7期|48-55|共8页
  • 作者单位

    Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan, Hubei, Peoples R China;

    Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan, Hubei, Peoples R China;

    Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan, Hubei, Peoples R China;

    Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan, Hubei, Peoples R China;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Image caption; Semantic attributes; Convolutional neural network; Long short-term memory networks;

    机译:图像标题;语义属性;卷积神经网络;长短期记忆网络;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号