首页> 外文期刊>Pattern Recognition: The Journal of the Pattern Recognition Society >Dual self-attention with co-attention networks for visual question answering
【24h】

Dual self-attention with co-attention networks for visual question answering

机译:具有共同关注网络的双重自我关注,用于视觉问题

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Visual Question Answering (VQA) as an important task in understanding vision and language has been proposed and aroused wide interests. In previous VQA methods, Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are generally used to extract visual and textual features respectively, and then the correlation between these two features is explored to infer the answer. However, CNN mainly focuses on extracting local spatial information and RNN pays more attention on exploiting sequential architecture and long-range dependencies. It is difficult for them to integrate the local features with their global dependencies to learn more effective representations of the image and question. To address this problem, we propose a novel model, i.e., Dual Self-Attention with Co-Attention networks (DSACA), for VQA. It aims to model the internal dependencies of both the spatial and sequential structure respectively by using the newly proposed self-attention mechanism. Specifically, DSACA mainly contains three sub modules. The visual self-attention module selectively aggregates the visual features at each region by a weighted sum of the features at all positions. The textual self-attention module automatically emphasizes the interdependent word features by integrating associated features among the sentence words. Besides, the visual-textual co-attention module explores the close correlation between visual and textual features learned from self-attention modules. The three modules are integrated into an end-to-end framework to infer the answer. Extensive experiments performed on three generally used VQA datasets confirm the favorable performance of DSACA compared with state-of-the-art methods.
机译:视觉问答作为理解视觉和语言的一项重要任务已经被提出并引起了广泛的兴趣。在以前的VQA方法中,卷积神经网络(CNN)和递归神经网络(RNN)通常分别用于提取视觉和文本特征,然后探索这两个特征之间的相关性来推断答案。然而,CNN主要关注于提取局部空间信息,RNN更关注于利用序列结构和远程依赖性。他们很难将局部特征与全局依赖性结合起来,学习更有效的图像和问题表示。为了解决这个问题,我们提出了一种新的VQA模型,即双重自我注意与共同注意网络(DSACA)。其目的是利用新提出的自我注意机制,分别对空间结构和序列结构的内部依赖性进行建模。具体来说,DSACA主要包含三个子模块。视觉自我注意模块通过所有位置特征的加权和,选择性地聚合每个区域的视觉特征。文本自我注意模块通过整合句子词之间的关联特征,自动强调相互依赖的词特征。此外,视觉-文本共同注意模块探索了从自我注意模块中学习到的视觉和文本特征之间的密切关系。这三个模块被集成到一个端到端的框架中,以推断答案。在三个常用的VQA数据集上进行的大量实验证实,与最先进的方法相比,DSACA具有良好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号