首页> 外文会议>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies >Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models
【24h】

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

机译:零射击交叉传输视觉模型的多语言多模态预培训

获取原文

摘要

This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextual multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (Multi-HowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX: as well as in multilingual text-to-image search on Multi30K.
机译:本文研究零拍摄的视觉语言模型的交叉传输。 具体而言,我们专注于多语言文本到视频搜索,并提出一种基于变换器的模型,用于学习上下文的多语言多模式嵌入品。 在零拍摄环境下,我们经验证明,当我们用非英语句子查询多语言文本视频模型时,性能显着降低。 为了解决这个问题,我们介绍了多语言的多模式预训练策略,并收集了一个新的多语言教学视频数据集(Multi-HOWTO100M)以进行预培训。 VTT的实验表明,我们的方法在没有额外注释的情况下,我们的方法在非英语语言中显着改善了视频搜索。 此外,当多语言注释可用时,我们的方法在VTT和Vatex上的多语言文本到视频搜索中的大余量优于最近的基线:以及多通平面积的多语言文本到图像搜索中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号