VD-SAN: Visual-Densely Semantic Attention Network for Image Caption Generation

He Xinwei; Yang Yang; Shi Baoguang; Bai Xiang

首页> 外文期刊>Neurocomputing >VD-SAN: Visual-Densely Semantic Attention Network for Image Caption Generation

【24h】

VD-SAN: Visual-Densely Semantic Attention Network for Image Caption Generation

机译：VD-SAN：用于图像字幕生成的视觉密集语义注意网络

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

AI期刊论文写作 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Recently, attribute has demonstrated its effectiveness in guiding image captioning system. However, most attributes based image captioning methods treat the attributes prediction task as a separate task and rely on a standalone stage to obtain the attributes for the given image, e.g., a pre-trained network like Fully Convolutional Neural Network (FCN) is usually adopted. Inherently, they ignore the correlation between the attribute prediction task and image representation extraction task, and at the same time increases the complexity of the image captioning system. In this paper, we aim to couple the attributes prediction stage and image representation extraction stage tightly and propose a novel and efficient image captioning framework called Visual-Densely Semantic Attention Network(VD-SAN). In particular, the whole captioning system consists of shared convolutional layers from Dense Convolutional Network (DenseNet), which are further split into a semantic attributes prediction branch and an image feature extraction branch, two semantic attention models, and a long short-term memory networks (LSTM) for caption generation. To evaluate the proposed architecture, we construct Flickr30K-ATT and MS-COCO-ATT datasets based on the original popular image caption datasets Flickr30K and MS COCO respectively, and each image from Flickr30K-ATT or MS-COCO-ATT is annotated with an attribute list in addition to the corresponding caption. Empirical results demonstrate that our captioning system can achieve significant improvements over state-of-the-art approaches. (c) 2018 Elsevier B.V. All rights reserved.

机译：最近，attribute已证明其在指导图像字幕系统中的有效性。然而，大多数基于属性的图像字幕方法将属性预测任务视为单独的任务，并依赖于独立的阶段来获取给定图像的属性，例如，通常采用像全卷积神经网络（FCN）这样的预训练网络。。他们固有地忽略了属性预测任务和图像表示提取任务之间的相关性，同时增加了图像字幕系统的复杂性。在本文中，我们旨在将属性预测阶段和图像表示提取阶段紧密耦合，并提出一种新颖且有效的图像字幕框架，称为视觉密集语义注意网络（VD-SAN）。特别地，整个字幕系统由来自密集卷积网络（DenseNet）的共享卷积层组成，该层进一步分为语义属性预测分支和图像特征提取分支，两个语义注意模型和一个长短期记忆网络。（LSTM）用于字幕生成。为了评估所提出的体系结构，我们分别基于原始的流行图像标题数据集Flickr30K和MS COCO构造了Flickr30K-ATT和MS-COCO-ATT数据集，并且对Flickr30K-ATT或MS-COCO-ATT的每个图像进行了注释除了相应的标题之外，还列出。实证结果表明，我们的字幕系统可以在最先进的方法上取得重大进步。（c）2018 Elsevier B.V.保留所有权利。

著录项

来源
《Neurocomputing》 |2019年第7期|48-55|共8页
作者
He Xinwei; Yang Yang; Shi Baoguang; Bai Xiang;
展开▼
作者单位

Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan, Hubei, Peoples R China;

Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan, Hubei, Peoples R China;

Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan, Hubei, Peoples R China;

Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan, Hubei, Peoples R China;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Image caption; Semantic attributes; Convolutional neural network; Long short-term memory networks;

机译：图像标题;语义属性;卷积神经网络;长短期记忆网络;

相似文献

外文文献
中文文献
专利

1. Spatial Relational Attention Using Fully Convolutional Networks for Image Caption Generation [J] . International Journal of Computational Intelligence and Applications . 2020,第2期

机译：使用完全卷积网络的空间关系注意图片
2. Exploring Multi-Level Attention and Semantic Relationship for Remote Sensing Image Captioning [J] . Yuan Zhenghang, Li Xuelong, Wang Qi Quality Control, Transactions . 2020,第期

机译：探索遥感图像标题的多级关注和语义关系
3. Object-aware semantics of attention for image captioning [J] . Shiwei Wang, Long Lan, Xiang Zhang, Multimedia Tools and Applications . 2020,第3a4期

机译：对象感知图像标题注意的语义
4. Image Captioning Based on Visual and Semantic Attention [C] . Haiyang Wei, Zhixin Li, Canlong Zhang International Conference on Multimedia Modeling . 2020

机译：基于视觉和语义注意的图像字幕
5. Ensemble Learning on Deep Neural Networks for Image Caption Generation [D] . Katpally, Harshitha 2019

机译：在深度神经网络上进行集成学习以生成图像字幕
6. Social Image Captioning: Exploring Visual Attention and User Attention [O] . Leiquan Wang, Xiaoliang Chu, Weishan Zhang, 2018

机译：社交图像字幕：探索视觉注意力和用户注意力
7. Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation [O] . Ling Cheng, Wei Wei, Xianling Mao, 2020

机译：Stack-VS：图像字幕生成的堆叠视觉语义关注

VD-SAN: Visual-Densely Semantic Attention Network for Image Caption Generation

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅