Dual self-attention with co-attention networks for visual question answering

Liu Yun; Zhang Xiaoming; Zhang Qianyun; Li Chaozhuo; Huang Feiran; Tang Xianghong; Li Zhoujun

首页> 外文期刊>Pattern Recognition: The Journal of the Pattern Recognition Society >Dual self-attention with co-attention networks for visual question answering

【24h】

Dual self-attention with co-attention networks for visual question answering

机译：具有共同关注网络的双重自我关注，用于视觉问题

获取原文

获取原文并翻译 | 示例

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Visual Question Answering (VQA) as an important task in understanding vision and language has been proposed and aroused wide interests. In previous VQA methods, Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are generally used to extract visual and textual features respectively, and then the correlation between these two features is explored to infer the answer. However, CNN mainly focuses on extracting local spatial information and RNN pays more attention on exploiting sequential architecture and long-range dependencies. It is difficult for them to integrate the local features with their global dependencies to learn more effective representations of the image and question. To address this problem, we propose a novel model, i.e., Dual Self-Attention with Co-Attention networks (DSACA), for VQA. It aims to model the internal dependencies of both the spatial and sequential structure respectively by using the newly proposed self-attention mechanism. Specifically, DSACA mainly contains three sub modules. The visual self-attention module selectively aggregates the visual features at each region by a weighted sum of the features at all positions. The textual self-attention module automatically emphasizes the interdependent word features by integrating associated features among the sentence words. Besides, the visual-textual co-attention module explores the close correlation between visual and textual features learned from self-attention modules. The three modules are integrated into an end-to-end framework to infer the answer. Extensive experiments performed on three generally used VQA datasets confirm the favorable performance of DSACA compared with state-of-the-art methods.

机译：视觉问答作为理解视觉和语言的一项重要任务已经被提出并引起了广泛的兴趣。在以前的VQA方法中，卷积神经网络（CNN）和递归神经网络（RNN）通常分别用于提取视觉和文本特征，然后探索这两个特征之间的相关性来推断答案。然而，CNN主要关注于提取局部空间信息，RNN更关注于利用序列结构和远程依赖性。他们很难将局部特征与全局依赖性结合起来，学习更有效的图像和问题表示。为了解决这个问题，我们提出了一种新的VQA模型，即双重自我注意与共同注意网络（DSACA）。其目的是利用新提出的自我注意机制，分别对空间结构和序列结构的内部依赖性进行建模。具体来说，DSACA主要包含三个子模块。视觉自我注意模块通过所有位置特征的加权和，选择性地聚合每个区域的视觉特征。文本自我注意模块通过整合句子词之间的关联特征，自动强调相互依赖的词特征。此外，视觉-文本共同注意模块探索了从自我注意模块中学习到的视觉和文本特征之间的密切关系。这三个模块被集成到一个端到端的框架中，以推断答案。在三个常用的VQA数据集上进行的大量实验证实，与最先进的方法相比，DSACA具有良好的性能。

著录项

来源
《Pattern Recognition: The Journal of the Pattern Recognition Society》 |2021年第1期|共13页
作者
Liu Yun; Zhang Xiaoming; Zhang Qianyun; Li Chaozhuo; Huang Feiran; Tang Xianghong; Li Zhoujun;
展开▼
作者单位

Beijing Informat Sci &

Technol Univ Beijing Key Lab Internet Culture &

Digital Dissem Beijing Peoples R China;

Beihang Univ Sch Cyber Sci &

Technol Beijing Peoples R China;

Beihang Univ Sch Cyber Sci &

Technol Beijing Peoples R China;

Microsoft Res Asia Beijing Peoples R China;

Jinan Univ Coll Informat Sci &

Technol Coll Cyber Secur Guangzhou Peoples R China;

Guizhou Univ Key Lab Adv Mfg Technol Minist Educ Guiyang Peoples R China;

Beihang Univ Sch Comp Sci &

Engn State Key Lab Software Dev Environm Beijing Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
Self-attention; Visual-textual co-attention; Visual question answering;

机译：自我关注;视觉文本的共同关注;视觉问题应答;

相似文献

外文文献
中文文献
专利

1. Temporality-enhanced knowledge memory network for factoid question answering [J] . Xin-yu DUAN, Si-liang TANG, Sheng-yu ZHANG, 浙江大学学报（英文版）（C辑：计算机与电子） . 2018,第001期
2. Deep Multimodal Reinforcement Network with Contextually Guided Recurrent Attention for Image Question Answering [J] . Ai-Wen Jiang, Bo Liu, Ming-Wen Wang 计算机科学技术学报（英文版） . 2017,第004期
3. Cross-modality co-attention networks for visual question answering [J] . Han Dezhi, Zhou Shuli, Li Kuan Ching, Soft computing: A fusion of foundations, methodologies and applications . 2021,第7期

机译：用于视觉问题的跨模型共关联网络
4. Multimodal feature-wise co-attention method for visual question answering [J] . Zhang Sheng, Chen Min, Chen Jincai, Information Fusion . 2021,第1期

机译：多模式特征 - 明智的共同关注方法，用于视觉问题应答
5. Multi-Tier Attention Network using Term-weighted Question Features for Visual Question Answering [J] . Manmadhan Sruthy, Kovoor Binsu C. Image and Vision Computing . 2021,第Nova期

机译：使用术语加权问题的多层关注网络，用于视觉问题应答
6. Multi-Channel Co-Attention Network for Visual Question Answering [C] . Weidong Tian, Bin He, Nanxun Wang, International Joint Conference on Neural Networks . 2020

机译：视觉问答的多渠道共同关注网络
7. Inferring answer quality, answerer expertise, and ranking in question answer social networks. [D] . Cai, Yuanzhe. 2014

机译：推断回答质量，回答者专业知识以及对问题进行回答的社交网络的排名。
8. An Effective Dense Co-Attention Networks for Visual Question Answering [O] . Shirong He, Dezhi Han 2020

机译：用于视觉问题的有效密集的联合网络
9. Cross-Modal Multistep Fusion Network with Co-Attention for Visual Question Answering [O] . Mingrui Lao, Yanming Guo, Hui Wang, 2018

机译：具有共同关注的跨模型多步融合网络，用于视觉问题应答
10. Questions and Answers on Quality, the ISO 9000 Standard Series, Quality SystemRegistration, and Related Issues. More Questions and Answers on the ISO 9000 Standard Series and Related Issues [R] . Breitenberg, M. 1993

机译：有关质量的问题和解答，IsO 9000标准系列，质量体系注册和相关问题。有关IsO 9000标准系列及相关问题的更多问题和解答

Dual self-attention with co-attention networks for visual question answering

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅