首页> 外文期刊>Knowledge-Based Systems >Multimodal deep fusion for image question answering
【24h】

Multimodal deep fusion for image question answering

机译:图像问题的多模式深融合

获取原文
获取原文并翻译 | 示例
           

摘要

Multimodal fusion plays a key role in Image Question Answering (IQA). However, most of the current algorithms are insufficient to fuse multiple relations implied in multimodalities which are vital for predicting correct answers. In this paper, we design an effective Multimodal Deep Fusion Network (MDFNet) to achieve fine-grained multimodal fusion. Specifically, we propose Graph Reasoning and Fusion Layer (GRFL) to reason complex spatial and semantic relations between visual objects and fuse these two kinds of relations adaptively. This fusion strategy allows different relations make different contribution guided by the reasoning step. Then a Multimodal Deep Fusion Network is built based on stacking several GRFLs, to achieve sufficient multimodal fusion. Quantitative and qualitative experiments conducted on popular benchmarks including VQA v2 and GQA reveal the effectiveness of DMFNet. Our best single model achieves 71.19% overall accuracy on VQA v2 dataset, and 57.05% accuracy on GQA dataset. (C) 2020 Elsevier B.V. All rights reserved.
机译:多模式融合在图像问题应答(IQA)中起着关键作用。然而,大多数当前算法不足以使多重差异暗示的多重关系,这对于预测正确答案至关重要。在本文中,我们设计了一种有效的多模态深融合网络(MDFNET),以实现细粒度的多模型融合。具体而言,我们提出了图形推理和融合层(GRFL)来推理视觉物体之间的复杂空间和语义关系,并自适应地熔断这两种关系。这种融合策略允许不同的关系通过推理步骤进行不同的贡献。然后基于堆叠几个GRFL构建多模式深融合网络,以实现足够的多模式融合。在包括VQA V2和GQA的流行基准上进行的定量和定性实验,揭示了DMFNET的有效性。我们最好的单一模型在VQA V2数据集中实现了71.19%的总体准确性,以及GQA数据集的精度为57.05%。 (c)2020 Elsevier B.v.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号