Revisiting Visual Question Answering Baselines

机译：重新审视视觉问答基准

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems include attention or memory mechanisms designed to perform "reasoning". Furthermore, for the task of multiple-choice VQA, nearly all of these systems train a multi-class classifier on image and question features to predict an answer. This paper questions the value of these common practices and develops a simple alternative model based on binary classification. Instead of treating answers as competing choices, our model receives the answer as input and predicts whether or not an image-question-answer triplet is correct. We evaluate our model on the Visual7W Telling and the VQA Real Multiple Choice tasks, and find that even simple versions of our model perform competitively. Our best model achieves state-of-the-art performance of 65.8% accuracy on the Visual7W Telling task and compares surprisingly well with the most complex systems proposed for the VQA Real Multiple Choice task. Additionally, we explore variants of the model and study the transferability of the model between both datasets. We also present an error analysis of our best model, the results of which suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers.

机译：视觉问答（VQA）是一种有趣的学习设置，用于评估当前图像理解系统的功能和不足。最近提出的许多VQA系统都包含设计用于执行“推理”的注意力或记忆机制。此外，对于多项选择VQA的任务，几乎所有这些系统都在图像和问题特征方面训练了多分类器以预测答案。本文质疑这些常规做法的价值，并基于二元分类建立一个简单的替代模型。我们的模型不是将答案视为竞争选择，而是将答案作为输入，并预测图像问题答案三元组是否正确。我们在Visual7W Telling和VQA真正的多项选择任务上评估了我们的模型，发现即使简单的模型版本也具有竞争力。我们的最佳模型在Visual7W Telling任务上实现了65.8％的最新精度，并且与为VQA真正多项选择任务建议的最复杂的系统相比令人惊讶地出色。此外，我们探索了模型的变体，并研究了模型在两个数据集之间的可转移性。我们还提供了对最佳模型的错误分析，其结果表明，当前VQA系统的关键问题在于问题和答案中缺少可视化的基础和概念的本地化。

著录项

来源
《European conference on computer vision》|2016年|727-739|共13页
会议地点
作者
Allan Jabri; Armand Joulin; Laurens van der Maaten;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Visual question answering; Dataset bias;

机译：视觉问题解答;数据集偏差;

相似文献

外文文献
中文文献
专利

1. Multiple answers to a question: a new approach for visual question answering [J] . Hosseinabad Sayedshayan Hashemi, Safayani Mehran, Mirzaei Abdolreza The Visual Computer . 2021,第1期

机译：问题的多个答案：一种新的视觉问题接听方法
2. Question-aware prediction with candidate answer recommendation for visual question answering [J] . B. Kim, J. Kim Electronics Letters . 2017,第18期

机译：带有候选答案推荐的问题感知预测，用于视觉问答
3. Multi-Tier Attention Network using Term-weighted Question Features for Visual Question Answering [J] . Manmadhan Sruthy, Kovoor Binsu C. Image and Vision Computing . 2021,第Nova期

机译：使用术语加权问题的多层关注网络，用于视觉问题应答
4. Revisiting Visual Question Answering Baselines [C] . Allan Jabri, Armand Joulin, Laurens van der Maaten European Conference on Computer Vision . 2016

机译：重新审视视觉问题的回答基准
5. Attention Correction Mechanisms in Visual Contexts in Visual Question Answering [D] . Sharan, Komal 2018

机译：视觉问答中视觉上下文中的注意力纠正机制
6. Removal of failed dental implants revisited: Questions and answers [O] . Alex Solderer, Adrian Al‐Jazrawi, Philipp Sahrmann, 2019

机译：重新去除失败的牙种植体：问题与解答
7. Revisiting Visual Question Answering Baselines [O] . Jabri, Allan, Joulin, Armand, van der Maaten, Laurens 2016

机译：重新审视视觉问题回答基线

Revisiting Visual Question Answering Baselines

摘要

著录项

相似文献

相关主题

期刊订阅