首页> 外文期刊>Information Fusion >Information fusion in visual question answering: A Survey

Information fusion in visual question answering: A Survey


获取原文并翻译 | 示例


Visual question answering automatically answers natural language questions according to the content of an image or video. The task is challenging because it requires the understanding of semantic information in the textual and visual channels, as well as their interplay. A typical solver is composed of three components: feature extraction from singular modality, feature fusion between visual and textual channels, and answer prediction based on the learnt joint representation. Among them, information fusion plays a key role in enhancing the overall accuracy and various types of approaches have been proposed, such as simple vector operators, deep neural networks, bilinear pooling, attention mechanisms, and memory networks. The primary objective of this survey is to provide a clear organization and comprehensive review on the ever-proposed fusion techniques in the domain of visual question answering. We propose an abstract fusion framework that can fit the majority of existing VQA models, making it convenient for readers to quickly understand their key contributions. Finally, we summarize the effective fusion strategies that have been widely adopted so as to benefit readers in their model design.



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号