首页> 外文会议>International Conference on Pattern Recognition >Answer-checking in Context: A Multi-modal Fully Attention Network for Visual Question Answering
【24h】

Answer-checking in Context: A Multi-modal Fully Attention Network for Visual Question Answering

机译:在上下文中回答 - 检查:用于视觉问题的多模态完全注意网络

获取原文

摘要

Visual Question Answering (VQA) is challenging due to the complex cross-modal relations. It has received extensive attention from the research community. From the human perspective, to answer a visual question, one needs to read the question and then refer to the image to generate an answer. This answer will then be checked against the question and image again for the final confirmation. In this paper, we mimic this process and propose a fully attention based VQA architecture. Moreover, an answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation to update the answer. This mimics the human answer checking process to consider the answer in the context. With answer-checking modules and transferred BERT layers, our model achieves the state-of-the-art accuracy 71.57% using fewer parameters on VQA-v2.0 test-standard split.
机译:由于复杂的跨模式关系,视觉问题应答(VQA)是挑战。 它受到了研究界的广泛关注。 从人类的角度来看,要回答视觉问题,人们需要阅读问题,然后引用图像以生成答案。 然后将再次检查此答案并再次检查问题和图像进行最终确认。 在本文中,我们模仿此过程并提出了完全关注的VQA架构。 此外,提出了一个答案检查模块,以在联合答案,问题和图像表示上执行统一的注意,以更新答案。 这模仿人类答案检查过程以考虑上下文中的答案。 通过答案检查模块和转移伯特层,我们的车型使用更少的参数在VQA-V2.0测试标准分割上实现最先进的准确度71.57%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号