首页> 外文会议>Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge >Visual Grounding Strategies for Text-Only Natural Language Processing
【24h】

Visual Grounding Strategies for Text-Only Natural Language Processing

机译:仅限文本自然语言处理的视觉接地策略

获取原文

摘要

Visual grounding is a promising path toward more robust and accurate Natural Language Processing (NLP) models. Many multi-modal extensions of BERT (e.g., VideoBERT, LXMERT, VL-BERT) allow a joint modeling of texts and images that lead to state-of-the-art results on multimodal tasks such as Visual Question Answering. Here, we leverage multimodal modeling for purely textual tasks (language modeling and classification) with the expectation that the multimodal pretraining provides a grounding that can improve text processing accuracy. We propose possible strategies in this respect. A first type of strategy, referred to as transferred grounding consists in applying multimodal models to text-only tasks using a placeholder to replace image input. The second one, which we call associative grounding, harnesses image retrieval to match texts with related images during both pretraining and text-only downstream tasks. We draw further distinctions into both strategies and then compare them according to their impact on language modeling and commonsense-related downstream tasks, showing improvement over text-only baselines.
机译:视觉接地是一个有希望的途径,旨在更强大,更准确的自然语言处理(NLP)模型。 BERT的许多多模态扩展(例如,Videobert,LXMERT,VL-BERT)允许联合建模文本和图像,这些文本和图像导致最先进的结果,例如视觉问题的多模式任务。在这里,我们利用多模型建模以获得纯文本任务(语言建模和分类),期望多模式预介质提供可以提高文本处理精度的接地。我们提出了这方面的可能策略。第一种类型的策略,称为转移接地包括使用占位符替换图像输入的仅文本任务应用多模式模型。我们调用关联接地的第二个,利用图像检索来匹配在预先预测和仅文本下游任务期间与相关图像的文本匹配。我们将进一步区别进一步分为两种策略,然后根据其对语言建模和与致命相关的下游任务的影响进行比较,显示出对仅文本基线的改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号