Deep Attention Neural Tensor Network for Visual Question Answering

机译：深度注意神经张量网络用于视觉问答

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Visual question answering (VQA) has drawn great attention in cross-modal learning problems, which enables a machine to answer a natural language question given a reference image. Significant progress has been made by learning rich embedding features from images and questions by bilinear models, while neglects the key role from answers. In this paper, we propose a novel deep attention neural tensor network (DA-NTN) for visual question answering, which can discover the joint correlations over images, questions and answers with tensor-based representations. First, we model one of the pairwise interaction (e.g., image and question) by bilinear features, which is further encoded with the third dimension (e.g., answer) to be a triplet by bilinear tensor product. Second, we decompose the correlation of different triplets by different answer and question types, and further propose a slice-wise attention module on tensor to select the most discriminative reasoning process for inference. Third, we optimize the proposed DA-NTN by learning a label regression with KL-divergence losses. Such a design enables scalable training and fast convergence over a large number of answer set. We integrate the proposed DA-NTN structure into the state-of-the-art VQA models (e.g., MLB and MUTAN). Extensive experiments demonstrate the superior accuracy than the original MLB and MUTAN models, with 1.98%, 1.70% relative increases on VQA-2.0 dataset, respectively.

机译：视觉问题解答（VQA）在跨模式学习问题中引起了极大的关注，这使机器能够在给定参考图像的情况下回答自然语言的问题。通过利用双线性模型从图像和问题中学习丰富的嵌入特征，而忽略了答案中的关键作用，已经取得了重大进展。在本文中，我们提出了一种新颖的深度关注神经张量网络（DA-NTN）用于视觉问题回答，它可以发现基于张量表示的图像，问题和答案之间的联合相关性。首先，我们通过双线性特征对成对交互（例如，图像和问题）进行建模，然后使用第三维（例如，答案）将其进一步编码为双线性张量积的三元组。其次，我们通过不同的答案和问题类型分解不同的三胞胎的相关性，并进一步在张量上提出一个分段注意模块，以选择最具判别力的推理过程进行推理。第三，我们通过学习带有KL散度损失的标签回归来优化建议的DA-NTN。这样的设计使得可扩展的训练和在大量答案集上的快速收敛成为可能。我们将建议的DA-NTN结构集成到最新的VQA模型（例如MLB和MUTAN）中。大量实验表明，与原始MLB和MUTAN模型相比，其准确性更高，在VQA-2.0数据集上，相对精度分别提高了1.98％和1.70％。

著录项

来源
《European conference on computer vision》|2018年|21-37|共17页
会议地点
作者
Yalong Bai; Jianlong Fu; Tiejun Zhao; Tao Mei;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Visual question answering; Neural tensor network; Open-ended VQA;

机译：视觉问题解答;神经张量网络开放式VQA;

相似文献

外文文献
中文文献
专利

1. Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? [J] . Abhishek Das, Harsh Agrawal, Larry Zitnick, Computer vision and image understanding . 2017,第octa期

机译：视觉问题解答中的人类注意力：人类和深层网络是否看待同一地区？
2. Visual question answering model based on graph neural network and contextual attention [J] . Sharma Himanshu, Jalal Anand Singh Image and Vision Computing . 2021,第Juna期

机译：基于图形神经网络和语境关注的视觉问题应答模型
3. Multi-Tier Attention Network using Term-weighted Question Features for Visual Question Answering [J] . Manmadhan Sruthy, Kovoor Binsu C. Image and Vision Computing . 2021,第Nova期

机译：使用术语加权问题的多层关注网络，用于视觉问题应答
4. Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? [C] . Abhishek Das, Harsh Agrawal, C. Lawrence Zitnick, Conference on empirical methods in natural language processing . 2016

机译：视觉问题解答中的人类注意力：人类和深层网络是否看待同一地区？
5. Attention Correction Mechanisms in Visual Contexts in Visual Question Answering [D] . Sharan, Komal 2018

机译：视觉问答中视觉上下文中的注意力纠正机制
6. An Effective Dense Co-Attention Networks for Visual Question Answering [O] . Shirong He, Dezhi Han 2020

机译：用于视觉问题的有效密集的联合网络
7. Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? [O] . Das, Abhishek, Agrawal, Harsh, Zitnick, C. Lawrence, 2016

机译：视觉问题答疑中的人文关注：做人与人网络看同一个地区？

Deep Attention Neural Tensor Network for Visual Question Answering

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅