Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

机译：Unicoder-VL：跨模型预培训的视觉和语言的通用编码器

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM (Lample and Conneau 2019) and Unicoder (Huang et al. 2019), both visual and linguistic contents are fed into a multi-layer Transformer (Vaswani et al. 2017) for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling (MLM), Masked Object Classification (MOC) and Visual-linguistic Matching (VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the cross-modal pre-training.

机译：我们提出Unicoder-VL，这是一个普遍的编码器，旨在以预先培训方式学习视觉和语言的联合表示。借用XLM（2019年Lopple和Compeau）和Unicoder（Huang等，2019）中的跨语预训练模型的想法，将视觉和语言内容融入多层变压器（Vaswani等，2017）对于跨模型预培训，其中采用了三个预先训练的任务，包括屏蔽语言建模（MLM），屏蔽对象分类（MOC）和视觉语言匹配（VLM）。前两个任务学习基于语言和视觉内容的输入令牌的上下文感知表示。最后一项任务试图预测图像和文本是否彼此描述。在预先绘制大规模图像标题对之后，我们将Unicoder-VL传输到基于标题的图像文本检索和可视致辞推理，只需一个额外的输出层。我们在两项任务中实现最先进的或可比结果，并显示出跨模型预培训的强大能力。

著录项

来源
《AAAI Conference on Artificial Intelligence》|2020年|11101-11790p|共9页
会议地点
作者
Gen Li; Nan Duan; Yuejian Fang; Ming Gong; Daxin Jiang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP18-53;
关键词

相似文献

外文文献
中文文献
专利

1. Substructural fragments: an universal language to encode reactions, molecular and supramolecular structures [J] . Varnek A, Fourches D, Hoonakker F, Journal of Computer-Aided Molecular Design . 2005,第9a10期

机译：亚结构片段：编码反应，分子和超分子结构的通用语言
2. Universal optical line terminal encoding and decoding architecture in two-code keying for noncoherent spectral amplitude coding optical code division multiple access systems [J] . Bih-Chyun Yeh, Cheing-Hong Lin, De-Nian Yang Optical engineering . 2014,第1期

机译：用于非相干频谱幅度编码光码分多址系统的两码键控中的通用光线路终端编码和解码架构
3. Sequential image encoding for vision-to-language problems [J] . Wang Jicheng, Zhou Yuanen, Hu Zhenzhen, Multimedia Tools and Applications . 2021,第11期

机译：用于视觉语言问题的顺序图像编码
4. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training [C] . Gen Li, Nan Duan, Yuejian Fang, AAAI Conference on Artificial Intelligence . 2020

机译：Unicoder-VL：跨模型预培训的视觉和语言的通用编码器
5. Dickens, cognition, and cross-modal vision and touch: Seeing and feeling in Dickens's hand-eye, railway metaphors (Charles Dickens). [D] . Corman, Kristen Louise. 2005

机译：狄更斯，认知和跨模态的视觉和触觉：在狄更斯的手眼，铁路隐喻中所见和所见（查尔斯·狄更斯）。
6. Neurochemistry Predicts Convergence of Written and Spoken Language: A Proton Magnetic Resonance Spectroscopy Study of Cross-Modal Language Integration [O] . Stephanie N. Del Tufo, Stephen J. Frost, Fumiko Hoeft, -1

机译：神经化学预测书面和口头语言的融合：跨模态语言整合的质子磁共振波谱研究
7. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [O] . Linjie Li, Yen-Chun Chen, Yu Cheng, 2020

机译：HERO：用于视频+语言的分层编码器Omni-Chinudation预培训

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

摘要

著录项

相似文献

相关主题

期刊订阅