首页> 外文会议>International Conference on Computer Vision >Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded
【24h】

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

机译:提示:利用解释使愿景和语言模型更接地

获取原文
获取外文期刊封面目录资料

摘要

Many vision and language models suffer from poor visual grounding -- often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image. In this work, we propose a generic approach called Human Importance-aware Network Tuning (HINT) that effectively leverages human demonstrations to improve visual grounding. HINT encourages deep networks to be sensitive to the same input regions as humans. Our approach optimizes the alignment between human attention maps and gradient-based network importances -- ensuring that models learn not just to look at but rather rely on visual concepts that humans found relevant for a task when making predictions. We apply HINT to Visual Question Answering and Image Captioning tasks, outperforming top approaches on splits that penalize over-reliance on language priors (VQA-CP and robust captioning) using human attention demonstrations for just 6% of the training data.
机译:许多愿景和语言模型遭受了糟糕的视觉接地 - 通常会落在易于学习的语言前沿,而不是基于他们在图像中的视觉概念上的决定。在这项工作中,我们提出了一种称为人类重要知识的网络调谐(提示)的通用方法,有效利用人类演示来提高视觉接地。提示鼓励深网络对与人类相同的输入区域敏感。我们的方法优化了人类注意地图和基于梯度的网络重要性的对齐 - 确保模型不仅仅是看看,而是依赖于在进行预测时对任务相关的视觉概念。我们将提示应用于视觉问题的回答和图像标题任务,表现出惩罚对语言前驱(VQA-CP和强大的标题)的分裂的顶级方法,使用人类注意力示范仅为培训数据的6%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号