Hierarchical LSTMs with Adaptive Attention for Visual Captioning

Gao Lianli; Li Xiangpeng; Song Jingkuan; Shen Heng Tao

首页> 外文期刊>IEEE Transactions on Pattern Analysis and Machine Intelligence >Hierarchical LSTMs with Adaptive Attention for Visual Captioning

【24h】

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

机译：具有自适应关注的分层LSTMS对视觉标题

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Recent progress has been made in using attention based encoder-decoder framework for image and video captioning. Most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., "gun" and "shooting") and non-visual words (e.g., "the", "a"). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of visual captioning. Furthermore, the hierarchy of LSTMs enables more complex representation of visual data, capturing information at different scales. Considering these issues, we propose a hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning. Specifically, the proposed framework utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information. Also, a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the caption generation. We design the hLSTMat model as a general framework, and we first instantiate it for the task of video captioning. Then, we further instantiate our hLSTMarefine it and apply it to the imioning task. To demonstrate the effectiveness of our proposed framework, we test our method on both video and image captioning tasks. Experimental results show that our approach achieves the state-of-the-art performance for most of the evaluation metrics on both tasks. The effect of important components is also well exploited in the ablation study.

机译：使用基于注意的编码器解码器框架进行了用于图像和视频字幕的最新进展。大多数现有解码器将注意力机制应用于每个生成的单词，包括视觉单词（例如，“枪”和“拍摄”）和非视觉单词（例如，“，”A“）。然而，这些非视词可以在不考虑视觉信号或注意力的情况下使用自然语言模型轻松预测。对非视觉单词的注意力机制施加误导并降低了视觉标题的整体性能。此外，LSTMS的层次结构使可视数据的更复杂表示，捕获不同尺度的信息。考虑到这些问题，我们提出了一种具有自适应注意力的分层LSTM（HLSTMAT）方法，用于图像和视频标题。具体地，所提出的框架利用用于选择特定区域或帧来预测相关字的空间或时间注意，而自适应注意是用于决定是否依赖于视觉信息或语言上下文信息。此外，分层LSTMS旨在同时考虑低级视觉信息和高级语言上下文信息以支持标题生成。我们将HLSTMAT模型设计为一般框架，我们首先将其实例化为视频字幕的任务。然后，我们进一步实例化了我们的HLSTMAREFINE it并将其应用于模拟任务。为了展示我们提出的框架的有效性，我们在视频和图像标题任务上测试我们的方法。实验结果表明，我们的方法为两项任务的大多数评估指标实现了最先进的性能。重要组成部分的效果也在消融研究中得到了很好的利用。

著录项

来源
《IEEE Transactions on Pattern Analysis and Machine Intelligence》 |2020年第5期|1112-1131|共20页
作者
Gao Lianli; Li Xiangpeng; Song Jingkuan; Shen Heng Tao;
展开▼
作者单位

Univ Elect Sci & Technol China Future Media Ctr Chengdu 611731 Peoples R China|Univ Elect Sci & Technol China Sch Comp Sci & Engn Chengdu 611731 Peoples R China;

Univ Elect Sci & Technol China Future Media Ctr Chengdu 611731 Peoples R China|Univ Elect Sci & Technol China Sch Comp Sci & Engn Chengdu 611731 Peoples R China;

Univ Elect Sci & Technol China Future Media Ctr Chengdu 611731 Peoples R China|Univ Elect Sci & Technol China Sch Comp Sci & Engn Chengdu 611731 Peoples R China;

Univ Elect Sci & Technol China Future Media Ctr Chengdu 611731 Peoples R China|Univ Elect Sci & Technol China Sch Comp Sci & Engn Chengdu 611731 Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Visualization; Feature extraction; Task analysis; Decoding; Adaptation models; Natural language processing; Video captioning; image captioning; adaptive attention; hierarchical structure;

机译：可视化;特征提取;任务分析;解码;适应模型;自然语言处理;视频标题;图像标题;自适应注意;层次结构;

相似文献

外文文献
中文文献
专利

1. DAA: Dual LSTMs with adaptive attention for image captioning [J] . Xiao Fen, Gong Xue, Zhang Yiming, Neurocomputing . 2019,第Octa28期

机译：DAA：具有自适应注意力的双重LSTM用于图像字幕
2. DAA: Dual LSTMs with adaptive attention for image captioning [J] . Xiao Fen, Gong Xue, Zhang Yiming, Neurocomputing . 2019,第OCTa28期

机译：DAA：具有自适应注意力的双重LSTM用于图像字幕
3. Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM [J] . Shuqin Chen, Xian Zhong, Lin Li, Neural processing letters . 2020,第3期

机译：基于BILSTM自适应地转换辅助属性和文本嵌入的视频字幕
4. Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning [C] . Jingkuan Song, Lianli Gao, Zhao Guo, International Joint Conference on Artificial Intelligence . 2019

机译：分层LSTM调整了视频字幕的时间关注
5. Arabic Image Captioning Using Deep Learning with Attention [D] . Sabri, Sabri Monaf. 2021

机译：使用深入学习的阿拉伯语图像标题
6. Social Image Captioning: Exploring Visual Attention and User Attention [O] . Leiquan Wang, Xiaoliang Chu, Weishan Zhang, 2018

机译：社交图像字幕：探索视觉注意力和用户注意力
7. Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning [O] . Song, Jingkuan, Guo, Zhao, Gao, Lianli, 2017

机译：具有视频字幕调整时间注意的分层LsTm
8. LSTM, GRU, Highway and a Bit of Attention: An Empirical Overview for Language Modeling in Speech Recognition. [R] . Irie, K., Tuske, Z., Alkhouli, T., 2016

机译：LsTm，GRU，公路和一点注意：语音识别中语言建模的经验概述。

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅