首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Multimodal Transformer for Unaligned Multimodal Language Sequences
【24h】

Multimodal Transformer for Unaligned Multimodal Language Sequences

机译:适用于未对齐多峰语言序列的多峰变压器

获取原文

摘要

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise cross-modal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.
机译:人类语言通常是多模式的,它包含自然语言,面部手势和听觉行为的混合体。但是,在对这种多模态人类语言时间序列数据进行建模时,存在两个主要挑战:1)由于来自每个模态的序列的可变采样率,导致固有的数据不对齐; 2)跨模式的元素之间的长期依赖关系。在本文中,我们介绍了多模态变压器(MulT),以端到端的方式解决上述问题,而无需明确对齐数据。该模型的核心是定向成对交叉模式注意,它关注跨不同时间步长的多模式序列之间的交互作用,并潜在地使一种模式到另一模式的流适应。对对齐和不对齐的多峰时间序列进行的综合实验表明,我们的模型在很大程度上优于最新方法。此外,经验分析表明,MulT中提出的交叉峰态注意机制能够捕获相关的交叉峰态信号。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号