Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

机译：观看，倾听和描述：全球和本地对齐的视频字幕对齐的跨模态注意

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

A major challenge for video captioning is to combine audio and visual cues. Existing multi-modal fusion methods have shown encouraging results in video understanding. However, the temporal structures of multiple modalities at different granularities are rarely explored, and how to selectively fuse the multi-modal representations at different levels of details remains uncharted. In this paper, we propose a novel hierarchically aligned cross-modal attention (HACA) framework to learn and selectively fuse both global and local temporal dynamics of different modalities. Furthermore, for the first time, we validate the superior performance of the deep audio features on the video captioning task. Finally, our HACA model significantly outperforms the previous best systems and achieves new state-of-the-art results on the widely used MSR-VTT dataset.

机译：视频标题的主要挑战是结合音频和视觉线索。现有的多模态融合方法显示了令人鼓舞的视频理解。然而，很少探索多种模式的时间结构，并且如何选择性地融合不同级别细节的多模态表示仍然是未公布的。在本文中，我们提出了一种小说的分层对齐的跨模型注意力（HACA）框架，以学习和选择性地融合不同方式的全球和局部时间动态。此外，我们首次验证了视频字幕任务上的深音频功能的卓越性能。最后，我们的HACA模型显着优于以前的最佳系统，并在广泛使用的MSR-VTT数据集中实现了新的最先进结果。

著录项

来源
《Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies》|2018年|liii 801 p.|共7页
会议地点
作者
Xin Wang; Yuan-Fang Wang; William Yang Wang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序设计、软件工程;
关键词

相似文献

外文文献
中文文献
专利

1. Recurrent convolutional video captioning with global and local attention [J] . Jin Tao, Li Yingming, Zhang Zhongfei Neurocomputing . 2019,第Deca22期

机译：具有全球和本地注意力的循环卷积视频字幕
2. Effects of modality preference and working memory capacity on captioned videos in enhancing L2 listening outcomes [J] . Kam Emily Fen, Liu Yeu-Ting, Tseng Wen-Ta ReCall . 2020,第PTa2期

机译：模态偏好和工作存储器容量对增强L2侦听结果中的标题视频的影响
3. Local and global aligned spatiotemporal attention network for video-based person re-identification [J] . Li Cheng, Xiao-Yuan Jing, Xiaoke Zhu, Multimedia Tools and Applications . 2020,第45a46期

机译：本地和全球对齐的时空注意网络用于视频的人重新识别
4. Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning [C] . Xin Wang, Yuan-Fang Wang, William Yang Wang Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . 2018

机译：观看，收听和描述：视频字幕的全局和局部对齐跨模态注意
5. Making Games Watchable: Broadcasting Video Games and Playing Attention [D] . Champlin, Alexander Doran. 2019

机译：制作游戏观察：广播视频游戏和笑容
6. Can Watching Online Videos Be Addictive? A Qualitative Exploration of Online Video Watching among Chinese Young Adults [O] . Zeyang Yang, Mark D. Griffiths, Zhihao Yan, 2021

机译：观看在线视频是否已上瘾？中国年轻成年人在线视频观看的定性探索
7. Watch, Listen Learn: Co-training on Captioned Images and Videos [O] . Sonal Gupta, Joohyun Kim, Kristen Grauman, -1

机译：观看，倾听和学习：关于字幕图像和视频的共同培训

Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

摘要

著录项

相似文献

相关主题

期刊订阅