SegEQA: Video Segmentation Based Visual Attention for Embodied Question Answering

机译：Segeqa：基于视频分割的视觉关注，体现了问题应答

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Embodied Question Answering (EQA) is a newly defined research area where an agent is required to answer the user's questions by exploring the real world environment. It has attracted increasing research interests due to its broad applications in automatic driving system, in-home robots, and personal assistants. Most of the existing methods perform poorly in terms of answering and navigation accuracy due to the absence of local details and vulnerability to the ambiguity caused by complicated vision conditions. To tackle these problems, we propose a segmentation based visual attention mechanism for Embodied Question Answering. Firstly, We extract the local semantic features by introducing a novel high-speed video segmentation framework. Then by the guide of extracted semantic features, a bottom-up visual attention mechanism is proposed for the Visual Question Answering (VQA) sub-task. Further, a feature fusion strategy is proposed to guide the training of the navigator without much additional computational cost. The ablation experiments show that our method boosts the performance of VQA module by 4.2% (68.99% vs 64.73%) and leads to 3.6% (48.59% vs 44.98%) overall improvement in EQA accuracy.

机译：体现的问题回答（EQA）是一个新定义的研究领域，需要通过探索现实世界环境来回答用户的问题。由于其在自动驾驶系统，家庭机器人和个人助理中的广泛应用，它引起了越来越多的研究兴趣。由于缺乏由复杂的视觉条件引起的歧义的局部细节和脆弱，大多数现有方法在应答和导航准确性方面表现不佳。为了解决这些问题，我们提出了一种基于分段的视觉注意机制，以实现体现的问题。首先，我们通过引入新颖的高速视频分段框架来提取局部语义特征。然后通过提取的语义特征指南，提出了一个自下而上的视觉注意机制，用于视觉问题应答（VQA）子任务。此外，提出了一种特征融合策略，以指导导航器的训练，而无需额外的计算成本。消融实验表明，我们的方法将VQA模块的性能提高了4.2％（68.99％与64.73％），导致欧足联准确性的总体改善3.6％（48.59％vs44.98％）。

著录项

来源
《International Conference on Computer Vision》|2019年|1 v.|共10页
会议地点
作者
Haonan Luo; Guosheng Lin; Zichuan Liu; Fayao Liu; Zhenmin Tang; Yazhou Yao;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP391.41;
关键词

相似文献

外文文献
中文文献
专利

1. Unifying the Video and Question Attentions for Open-Ended Video Question Answering [J] . Hongyang Xue, Zhou Zhao, Deng Cai IEEE Transactions on Image Processing . 2017,第12期

机译：统一开放式视频问答的视频和问题注意
2. Attention Based Multi-Modal Fusion Architecture for Open-Ended Video Question Answering Systems [J] . Sumedh Pendurkar, Sameer Kolpekwar, Shreyas Dhoot, Procedia Computer Science . 2020,第5期

机译：基于关注的开放式视频问题应答系统的多模态融合架构
3. Visual question answering model based on graph neural network and contextual attention [J] . Sharma Himanshu, Jalal Anand Singh Image and Vision Computing . 2021,第Juna期

机译：基于图形神经网络和语境关注的视觉问题应答模型
4. SegEQA: Video Segmentation Based Visual Attention for Embodied Question Answering [C] . Haonan Luo, Guosheng Lin, Zichuan Liu, International Conference on Computer Vision . 2019

机译：SegEQA：基于视频分割的视觉注意力，用于具体问题的回答
5. Attention Correction Mechanisms in Visual Contexts in Visual Question Answering [D] . Sharan, Komal 2018

机译：视觉问答中视觉上下文中的注意力纠正机制
6. An Effective Dense Co-Attention Networks for Visual Question Answering [O] . Shirong He, Dezhi Han 2020

机译：用于视觉问题的有效密集的联合网络
7. Segmentation Guided Attention Networks for Visual Question Answering [O] . Vasu Sharma, Ankita Bishnu, Labhesh Patel 2017

机译：分割引导关注网络，用于视觉问题应答
8. Practical Question and Answer Guide on VDTs (Video Display Terminals) for BEES (Base Bioenvironmental Engineer) [R] . Olson, B. M. 1985

机译：BEEs（基础生物环境工程师）VDT（视频显示终端）实用问答指南

SegEQA: Video Segmentation Based Visual Attention for Embodied Question Answering

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅