首页> 外文会议>IEEE/CVF Conference on Computer Vision and Pattern Recognition >Focal Visual-Text Attention for Visual Question Answering
【24h】

Focal Visual-Text Attention for Visual Question Answering

机译:针对视觉问题的关注关注

获取原文

摘要

Recent insights on language and vision with neural networks have been successfully applied to simple single-image visual question answering. However, to tackle real-life question answering problems on multimedia collections such as personal photos, we have to look at whole collections with sequences of photos or videos. When answering questions from a large collection, a natural problem is to identify snippets to support the answer. In this paper, we describe a novel neural network called Focal Visual-Text Attention network (FVTA) for collective reasoning in visual question answering, where both visual and text sequence information such as images and text metadata are presented. FVTA introduces an end-to-end approach that makes use of a hierarchical process to dynamically determine what media and what time to focus on in the sequential data to answer the question. FVTA can not only answer the questions well but also provides the justifications which the system results are based upon to get the answers. FVTA achieves state-of-the-art performance on the MemexQA dataset and competitive results on the MovieQA dataset.
机译:最近关于语言和愿景的洞察,已成功应用于简单的单一图像视觉问题应答。然而,为了解决现实生活问题,回答个人照片等多媒体集合问题,我们必须使用照片或视频的序列来看整个集合。在从大集合回答问题时,自然问题是识别片段以支持答案。在本文中,我们描述了一种名为焦点视觉文本关注网络(FVTA)的新型神经网络,用于在视觉问题应答中的集体推理,其中呈现了诸如图像和文本元数据的视觉和文本序列信息。 FVTA介绍了一种端到端的方法,它利用分层过程来动态地确定哪些媒体以及在顺序数据中关注哪些时间以回答问题。 FVTA不仅可以很好地回答问题,而且还提供了系统结果基于答案的理由。 FVTA在MemexQA数据集中实现最先进的性能,并在MovieQA DataSet上实现竞争结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号