Image captions: global-local and joint signals attention model (GL-JSAM)

Nuzhat Naqvi; ZhongFu Ye

首页> 外文期刊>Multimedia Tools and Applications >Image captions: global-local and joint signals attention model (GL-JSAM)

【24h】

Image captions: global-local and joint signals attention model (GL-JSAM)

机译：图像字幕：全局 - 本地和联合信号注意模型（GL-JSAM）

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

For automated visual captioning, existing neural encoder-decoder methods commonly use a simple sequence-to-sequence or an attention-based mechanism. The attention-based models pay attention to specific visual areas or objects; using a single heat map that indicates which portion of the image is most important rather than treating the objects (within the image) equally. These models are usually a mixture of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) architectures. CNN's generally extract global visual signals that only provide global information of main objects, attributes, and their relationship, but fail to provide local (regional) information within objects, such as lines, corners, curve and edges. On one hand, missing some of the information and details of local visual signals may lead to misprediction, misidentification of objects or completely missing the main object(s). On the other hand, additional visual signals information produces meaningless and irrelevant description, which may be coming from objects in foreground or background. To address these concerns, we created a new joint signals attention image captioning model for global and local signals that is adaptive by nature. Primarily, proposed model extracts global visual signals at image-level and local visual signals at object-level. The joint signal attention model (JSAM) plays a dual role in visual signal extraction and non-visual signal prediction. Initially, JSAM selects meaningful global and regional visual signals to discard irrelevant visual signals and integrates selected visual signals smartly. Subsequently, in a language model, smart JSAM decides at each time-step (level) on how to attend visual or non-visual signals to generate accurate, descriptive, and elegant sentences. Lastly, we examine the efficiency and superiority of the projected model over recent similar image captioning models by conducting essential experimentations on the MS-COCO dataset.

机译：对于自动视觉标题，现有的神经编码器 - 解码器方法通常使用简单的序列到序列或基于关注的机制。基于注意的模型注意特定的视觉区域或物体;使用单个热图表示图像的哪个部分是最重要的，而不是同等地处理物体（在图像内）。这些模型通常是卷积神经网络（CNN）和经常性神经网络（RNN）架构的混合。 CNN通常提取仅提供主要对象，属性及其关系的全局视觉信号，而是未能提供对象中的本地（区域）信息，例如线路，角落，曲线和边缘。一方面，缺少一些信息和本地视觉信号的细节可能导致错误的错误，对象的错误识别或完全缺少主要对象。另一方面，额外的视觉信号信息产生毫无意义和无关的描述，这可能来自前景或背景中的物体。为了解决这些问题，我们创建了一种新的联合信号注意图像标题模型，用于自然自适应的全局和本地信号。主要，所提出的模型在对象级别提取图像级和局部视觉信号的全局视觉信号。关节信号注意模型（JSAM）在视觉信号提取和非视信号预测中起作用的双重作用。最初，JSAM选择有意义的全局和区域视觉信号来丢弃无关的视觉信号，并巧妙地集成所选择的视觉信号。随后，在语言模型中，Smart JSAM在每个时间步骤（级别）上，如何参加视觉或非视觉信号以生成准确，描述性和优雅的句子。最后，我们通过在MS-Coco DataSet上进行基本实验来研究最近类似的图像标题模型的预计模型的效率和优越性。

著录项

来源
《Multimedia Tools and Applications》 |2020年第34期|24429-24448|共20页
作者
Nuzhat Naqvi; ZhongFu Ye;
展开▼
作者单位

University of Science and Technology of China (USTC) Hefei China;

University of Science and Technology of China (USTC) Hefei China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Image captioning; Global and local signals; Soft and hard visual attention; CNN; RNN; LSTM; and Faster-RCNN;

机译：图像标题;全球和本地信号;柔软和坚硬的视觉注意;CNN;rnn;LSTM;和更快的rcnn;

相似文献

外文文献
中文文献
专利

1. Image Captioning Using Region-Based Attention Joint with Time-Varying Attention [J] . Wang Weixuan, Hu Haifeng Neural processing letters . 2019,第1期

机译：使用基于区域的注意力联合时变注意力的图像字幕
2. Image Captioning Using Region-Based Attention Joint with Time-Varying Attention [J] . Wang Weixuan, Hu Haifeng Neural processing letters . 2019,第1期

机译：使用基于区域的注意力关节与时变关节的图像标题
3. Image Captioning with a Joint Attention Mechanism by Visual Concept Samples [J] . Yuan Jin, Zhang Lei, Guo Songrui, ACM transactions on multimedia computing communications and applications . 2020,第3期

机译：通过视觉概念样本与关注机制的图像标题
4. Image Caption with Global-Local Attention [C] . Linghui Li, Sheng Tang, Lixi Deng, AAAI Conference on Artificial Intelligence . 2017

机译：具有全局本地关注的图像标题
5. Arabic Image Captioning Using Deep Learning with Attention [D] . Sabri, Sabri Monaf. 2021

机译：使用深入学习的阿拉伯语图像标题
6. Social Image Captioning: Exploring Visual Attention and User Attention [O] . Leiquan Wang, Xiaoliang Chu, Weishan Zhang, 2018

机译：社交图像字幕：探索视觉注意力和用户注意力
7. Paying Attention to Descriptions Generated by Image Captioning Models [O] . Tavakoli, H., Shetty, R., Borji, A., 2017

机译：注意图像字幕模型生成的描述

Image captions: global-local and joint signals attention model (GL-JSAM)

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅