Multimodal Transformer for Unaligned Multimodal Language Sequences

机译：适用于未对齐多峰语言序列的多峰变压器

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise cross-modal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.

机译：人类语言通常是多模式的，它包含自然语言，面部手势和听觉行为的混合体。但是，在对这种多模态人类语言时间序列数据进行建模时，存在两个主要挑战：1）由于来自每个模态的序列的可变采样率，导致固有的数据不对齐; 2）跨模式的元素之间的长期依赖关系。在本文中，我们介绍了多模态变压器（MulT），以端到端的方式解决上述问题，而无需明确对齐数据。该模型的核心是定向成对交叉模式注意，它关注跨不同时间步长的多模式序列之间的交互作用，并潜在地使一种模式到另一模式的流适应。对对齐和不对齐的多峰时间序列进行的综合实验表明，我们的模型在很大程度上优于最新方法。此外，经验分析表明，MulT中提出的交叉峰态注意机制能够捕获相关的交叉峰态信号。

著录项

来源
《Annual meeting of the Association for Computational Linguistics》|2019年|6558-6569|共12页
会议地点
作者
Yao-Hung Hubert Tsai; Shaojie Bai; Paul Pu Liang; J. Zico Kolter; Louis-Philippe Morency; Ruslan Salakhutdinov;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language [J] . An-An Liu, Ning Xu, Yongkang Wong, Computer vision and image understanding . 2017,第octa期

机译：分层和多模式视频字幕：发现视觉的多模式知识并将其转移到语言
2. Multimodal interaction with virtual worlds XMMVR: eXtensible language for MultiModal interaction with virtual reality worlds [J] . Olmedo Hector, Escudero David, Cardenoso Valentin Journal on multimodal user interfaces . 2015,第3期

机译：与虚拟世界的多模式交互XMMVR：与虚拟现实世界的多模式交互的可扩展语言
3. Extending multimedia languages to support multimodal user interactions [J] . Vasconcelos Guedes Alan Livio, de Albuquerque Azevedo Roberto Gerson, Junqueira Barbosa Simone Diniz Multimedia Tools and Applications . 2017,第4期

机译：扩展多媒体语言以支持多模式用户交互
4. Multimodal Transformer for Unaligned Multimodal Language Sequences [C] . Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, Annual meeting of the Association for Computational Linguistics . 2019

机译：未对准多峰语言序列的多峰变压器
5. Language in Multimodal Writing Processes and Performance: Developing Multimodal Writing Tasks for L2 Learners [D] . ?Lim, Jung Min 2020

机译：语言的多式联运中写作过程和性能：开发多模写作任务为第二语言学习者
6. Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion [O] . Baijun Xie, Mariia Sidulova, Chung Hyuk Park 2021

机译：从谈话与基于变压器的横向融合的鲁棒多模态情绪识别
7. Low Rank Fusion based Transformers for Multimodal Sequences [O] . Saurav Sahay, Eda Okur, shachi H Kumar, 2020

机译：基于低等级序列的变压器

Multimodal Transformer for Unaligned Multimodal Language Sequences

摘要

著录项

相似文献

相关主题

期刊订阅