首页> 外文会议>International Conference on Knowledge and Systems Engineering >Integrating Transformer into Global and Residual Image Feature Extractor in Visual Question Answering for Blind People
【24h】

Integrating Transformer into Global and Residual Image Feature Extractor in Visual Question Answering for Blind People

机译:将变压器集成到全局和残差图像特征提取器中的视觉问题中回答盲人

获取原文

摘要

Visual Question Answering (VQA), the novel task among the intersection between Computer Vision (CV) and Natural Language Processing (NLP), extracts answers from features of both questions and images. The current approaches in VQA rely on the combination between convolution and recurrent networks, which leads to the huge number of parameters for learning phase. With the success of employing pre-trained models, we integrate BERT [1] for embedding text and two models: ResNets [2] and VGG [3] for embedding image. In addition, we also propose to take advantages of fine-tuning techniques and stacked attention mechanism to combine textual and visual features in a novel learning phase considered its ability to reduce the size of models. To demonstrate our model’s performance, we conduct experiments in the VizWiz VQA Challenge 2020. According to the experimental results, the proposed approach outperforms existing methods for Yes-No questions on VizWiz VQA dataset
机译:视觉问题应答(VQA),计算机视觉(CV)与自然语言处理(NLP)之间交叉口之间的新任务,从两个问题和图像的特征中提取答案。 VQA中的当前方法依赖于卷积和经常性网络之间的组合,这导致了学习阶段的大量参数。随着采用预先训练的模型的成功,我们集成了BERT [1]来嵌入文本和两个模型:RESNET [2]和VGG [3]进行嵌入图像。此外,我们还建议采取微调技术和堆叠注意机制,以将文本和视觉特征结合在新颖的学习阶段,认为其能够降低模型大小的能力。为了展示我们的模型的性能,我们在Vizwiz VQA挑战中进行实验2020.根据实验结果,所提出的方法优于Vizwiz VQA数据集的现有方法

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号