Multimodal deep fusion for image question answering

Zhang Weifeng; Yu Jing; Wang Yuxia; Wang Wei

首页> 外文期刊>Knowledge-Based Systems >Multimodal deep fusion for image question answering

【24h】

Multimodal deep fusion for image question answering

机译：图像问题的多模式深融合

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Multimodal fusion plays a key role in Image Question Answering (IQA). However, most of the current algorithms are insufficient to fuse multiple relations implied in multimodalities which are vital for predicting correct answers. In this paper, we design an effective Multimodal Deep Fusion Network (MDFNet) to achieve fine-grained multimodal fusion. Specifically, we propose Graph Reasoning and Fusion Layer (GRFL) to reason complex spatial and semantic relations between visual objects and fuse these two kinds of relations adaptively. This fusion strategy allows different relations make different contribution guided by the reasoning step. Then a Multimodal Deep Fusion Network is built based on stacking several GRFLs, to achieve sufficient multimodal fusion. Quantitative and qualitative experiments conducted on popular benchmarks including VQA v2 and GQA reveal the effectiveness of DMFNet. Our best single model achieves 71.19% overall accuracy on VQA v2 dataset, and 57.05% accuracy on GQA dataset. (C) 2020 Elsevier B.V. All rights reserved.

机译：多模式融合在图像问题应答（IQA）中起着关键作用。然而，大多数当前算法不足以使多重差异暗示的多重关系，这对于预测正确答案至关重要。在本文中，我们设计了一种有效的多模态深融合网络（MDFNET），以实现细粒度的多模型融合。具体而言，我们提出了图形推理和融合层（GRFL）来推理视觉物体之间的复杂空间和语义关系，并自适应地熔断这两种关系。这种融合策略允许不同的关系通过推理步骤进行不同的贡献。然后基于堆叠几个GRFL构建多模式深融合网络，以实现足够的多模式融合。在包括VQA V2和GQA的流行基准上进行的定量和定性实验，揭示了DMFNET的有效性。我们最好的单一模型在VQA V2数据集中实现了71.19％的总体准确性，以及GQA数据集的精度为57.05％。（c）2020 Elsevier B.v.保留所有权利。

著录项

来源
《Knowledge-Based Systems》 |2021年第5期|106639.1-106639.10|共10页
作者
Zhang Weifeng; Yu Jing; Wang Yuxia; Wang Wei;
展开▼
作者单位

Jiaxing Univ Coll Math Phys & Informat Engn Jiaxing Zhejiang Peoples R China;

Chinese Acad Sci Inst Informat Engn Beijing Peoples R China;

Jiangnan Elect Commun Inst Jiaxing Zhejiang Peoples R China;

Jiangnan Elect Commun Inst Jiaxing Zhejiang Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Multimodal fusion; Image question answering; Graph neural networks;

机译：多模式融合;图像问题应答;图形神经网络;

相似文献

外文文献
中文文献
专利

1. Deep Multimodal Reinforcement Network with Contextually Guided Recurrent Attention for Image Question Answering [J] . Ai-Wen Jiang, Bo Liu, Ming-Wen Wang 计算机科学技术学报（英文版） . 2017,第004期

机译：深度多模态强化网络，具有上下文指导的循环注意力，可回答图像问题
2. Multimodal feature fusion by relational reasoning and attention for visual question answering [J] . Zhang Weifeng, Yu Jing, Hu Hua, Information Fusion . 2020,第期

机译：通过关系推理和关注的多模式特征融合
3. Deep Multimodal Fusion Autoencoder for Saliency Prediction of RGB-D Images [J] . Kengda Huang, Wujie Zhou, Meixin Fang Computational intelligence and neuroscience . 2021,第a期

机译：深度多峰融合自动化器，用于RGB-D图像的显着性预测
4. MUTAN: Multimodal Tucker Fusion for Visual Question Answering [C] . Hedi Ben-younes, Remi Cadene, Matthieu Cord, IEEE International Conference on Computer Vision . 2017

机译：Mutan：多模式Tucker融合用于视觉问题的回答
5. Context Based Multi-Image Visual Question Answering (VQA) in Deep Learning [D] . Peddinti, Sudhakar Reddy. 2018

机译：深度学习中基于上下文的多图像视觉问答（VQA）
6. A Depth Evidence Score Fusion Algorithm for Chinese Medical Intelligence Question Answering System [O] . Xiabing Zhou, Binglin Wu, Qinglei Zhou 2018

机译：中国医学智能问答系统的深度证据分数融合算法
7. MUTAN: Multimodal Tucker Fusion for Visual Question Answering [O] . Ben-younes, Hedi, Cadene, Rémi, Cord, Matthieu, 2017

机译：mUTaN：用于视觉问答的多模式Tucker融合

Multimodal deep fusion for image question answering

摘要

著录项

相似文献

相关主题

期刊订阅