首页> 外文会议>IEEE/CVF Conference on Computer Vision and Pattern Recognition >Focal Visual-Text Attention for Visual Question Answering
【24h】

Focal Visual-Text Attention for Visual Question Answering

机译:视觉问题解答的焦点视觉文本注意

获取原文

摘要

Recent insights on language and vision with neural networks have been successfully applied to simple single-image visual question answering. However, to tackle real-life question answering problems on multimedia collections such as personal photos, we have to look at whole collections with sequences of photos or videos. When answering questions from a large collection, a natural problem is to identify snippets to support the answer. In this paper, we describe a novel neural network called Focal Visual-Text Attention network (FVTA) for collective reasoning in visual question answering, where both visual and text sequence information such as images and text metadata are presented. FVTA introduces an end-to-end approach that makes use of a hierarchical process to dynamically determine what media and what time to focus on in the sequential data to answer the question. FVTA can not only answer the questions well but also provides the justifications which the system results are based upon to get the answers. FVTA achieves state-of-the-art performance on the MemexQA dataset and competitive results on the MovieQA dataset.
机译:使用神经网络对语言和视觉的最新见解已成功应用于简单的单图像视觉问题解答。但是,要解决现实生活中诸如个人照片之类的多媒体收藏品中的问题解答问题,我们必须查看带有照片或视频序列的整个收藏品。回答来自大量收藏的问题时,自然的问题是要确定片段以支持答案。在本文中,我们描述了一种新颖的神经网络,称为焦点视觉文本注意网络(FVTA),用于视觉问答中的集体推理,其中同时显示了视觉和文本序列信息,例如图像和文本元数据。 FVTA引入了一种端到端方法,该方法利用分层过程来动态确定顺序数据中要关注的媒体和时间,以回答问题。 FVTA不仅可以很好地回答问题,而且可以提供系统结果用来获得答案的依据。 FVTA在MemexQA数据集上实现了最先进的性能,在MovieQA数据集上实现了竞争性结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号