首页> 外文会议>Chinese conference on pattern recognition and computer vision >Multi-modal Feature Fusion Based on Variational Autoencoder for Visual Question Answering
【24h】

Multi-modal Feature Fusion Based on Variational Autoencoder for Visual Question Answering

机译:基于变分自动编码器的多模态特征融合用于视觉问答

获取原文

摘要

Visual Question Answering (VQA) tasks must provide correct answers to the questions posed by given images. Such requirement has been a wide concern since this task was presented. VQA consists of four steps: image feature extraction, question text feature extraction, multi-modal feature fusion and answer reasoning. During multimodal feature fusion, outer product calculation is used in existing models, which leads to excessive model parameters, high training overhead, and slow convergence. To avoid these problems, we applied the Variational Autoencoder (VAE) method to calculate the probability distribution of the hidden variables of image and question text. Furthermore, we designed a question feature hierarchy method based on the traditional attention mechanism model and VAE. The objective is to investigate deep questions and image correlation features to improve the accuracy of VQA tasks.
机译:视觉问题解答(VQA)任务必须为给定图像提出的问题提供正确答案。自提出这项任务以来,这种要求一直是人们广泛关注的问题。 VQA包括四个步骤:图像特征提取,问题文本特征提取,多模式特征融合和答案推理。在多峰特征融合过程中,现有模型使用了外部乘积计算,这会导致模型参数过多,训练费用较高以及收敛速度较慢。为了避免这些问题,我们应用了变分自动编码器(VAE)方法来计算图像和问题文本的隐藏变量的概率分布。此外,我们设计了基于传统注意力机制模型和VAE的问题特征层次方法。目的是研究深层问题和图像关联功能,以提高VQA任务的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号