首页> 外文会议>Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies >Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning
【24h】

Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

机译:观看,倾听和描述:全球和本地对齐的视频字幕对齐的跨模态注意

获取原文

摘要

A major challenge for video captioning is to combine audio and visual cues. Existing multi-modal fusion methods have shown encouraging results in video understanding. However, the temporal structures of multiple modalities at different granularities are rarely explored, and how to selectively fuse the multi-modal representations at different levels of details remains uncharted. In this paper, we propose a novel hierarchically aligned cross-modal attention (HACA) framework to learn and selectively fuse both global and local temporal dynamics of different modalities. Furthermore, for the first time, we validate the superior performance of the deep audio features on the video captioning task. Finally, our HACA model significantly outperforms the previous best systems and achieves new state-of-the-art results on the widely used MSR-VTT dataset.
机译:视频标题的主要挑战是结合音频和视觉线索。现有的多模态融合方法显示了令人鼓舞的视频理解。然而,很少探索多种模式的时间结构,并且如何选择性地融合不同级别细节的多模态表示仍然是未公布的。在本文中,我们提出了一种小说的分层对齐的跨模型注意力(HACA)框架,以学习和选择性地融合不同方式的全球和局部时间动态。此外,我们首次验证了视频字幕任务上的深音频功能的卓越性能。最后,我们的HACA模型显着优于以前的最佳系统,并在广泛使用的MSR-VTT数据集中实现了新的最先进结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号