首页> 外文会议>Annual Meeting of the Association for Computational Linguistics;International Joint Conference on natural Language Processing >UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
【24h】

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

机译:UNIMO:通过跨模态对比学习实现统一模态理解和生成

获取原文

摘要

Existed pre-Iraining methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e., text or image) or limited multi-modal data (i.e., image-text pairs). In this work, we propose a UNItied-MOdal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections are utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space, over a corpus of image-text pairs augmented with related images and texts. With the help of rich non-paired single-modal data, our model is able to learn more generalizable representations, by allowing textual knowledge and visual knowledge to enhance each other in the unified semantic space. The experimental results show that UNIMO greatly improves the performance of several single-modal and multi-modal downstream tasks.
机译:现有的预激励方法要么侧重于单一模态任务,要么侧重于多模态任务,不能有效地相互适应。它们只能利用单一模式数据(即文本或图像)或有限的多模式数据(即图像-文本对)。在这项工作中,我们提出了一个统一的模态预训练体系结构,即UNIMO,它可以有效地适应单模态和多模态的理解和生成任务。大规模的自由文本语料库和图像集合被用来提高视觉和文本理解能力,跨模态对比学习(CMCL)被用来将文本和视觉信息整合到一个统一的语义空间中,在一个图像-文本对语料库中加入相关的图像和文本。借助于丰富的非配对单模态数据,我们的模型能够通过允许文本知识和视觉知识在统一的语义空间中相互增强来学习更多的可概括表示。实验结果表明,UNIMO极大地提高了多个单峰和多峰下游任务的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号