首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation
【24h】

Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation

机译:你是不是在寻找?视觉和语言导航中的多种模式接地

获取原文

摘要

Vision-and-Language Navigation (VLN) requires grounding instructions, such as turn right and stop at the door, to routes in a visual environment. The actual grounding can connect language to the environment through multiple modalities, e.g. stop at the door might ground into visual objects, while turn right might rely only on the geometric structure of a route. We investigate where the natural language empirically grounds under two recent state-of-the-art VLN models. Surprisingly, we discover that visual features may actually hurt these models: models which only use route structure, ablating visual features, outperform their visual counterparts in unseen new environments on the benchmark Room-to-Room dataset. To better use all the available modalities, we propose to decompose the grounding procedure into a set of expert models with access to different modalities (including object detections) and ensemble them at prediction time, improving the performance of state-of-the-art models on the VLN task.
机译:视觉和语言导航(VLN)需要接地指示,例如向右转并在门口停下,以在视觉环境中到达路线。实际的接地可以通过多种方式将语言连接到环境。停在门口可能会变成视觉对象,而右转可能仅取决于路线的几何结构。我们在两种最新的最先进的VLN模型下调查自然语言在何处经验依据。出乎意料的是,我们发现视觉特征实际上可能会损害这些模型:在基准的“房间到房间”数据集上看不见的新环境中,仅使用路线结构,消融视觉特征的模型要优于其视觉对应模型。为了更好地利用所有可用的模态,我们建议将接地过程分解为一组专家模型,该模型可以访问不同的模态(包括对象检测),并在预测时将它们集成在一起,从而改善最新模型的性能在VLN任务上。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号