首页> 外文期刊>Neurocomputing >Hierarchical attention-based multimodal fusion for video captioning
【24h】

Hierarchical attention-based multimodal fusion for video captioning

机译:基于分层注意的多模式融合,用于视频字幕

获取原文
获取原文并翻译 | 示例

摘要

Attention based encoder-decoder models have shown a great success on video captioning. Recent multi-modal video captioning mainly focused on applying the attention mechanism to all modalities and fusing them in the same level. However, the connections among specific modalities have not been investigated in the fusion process. In this paper, the expressivity of uni-modal is firstly investigated. Due to the characteristic of attention mechanism, an instance-level of visual content is exploited to refine the temporal features. Then, a semantic detection architecture based on CNN+RNN is also employed on the spatiotemporal content to exploit the correlations between semantic labels for better video semantic representation. Finally, a hierarchical attention-based multimodal fusion model for video captioning is proposed by jointly considering the intrinsic properties of multimodal features. Experimental results on the MSVD and MSR-VTT datasets show that the proposed method has achieved competitive performance compared with the related video captioning methods. (c) 2018 Elsevier B.V. All rights reserved.
机译:基于注意力的编码器-解码器模型在视频字幕上显示出巨大的成功。最近的多模式视频字幕主要集中于将注意力机制应用于所有模式并将它们融合在同一级别。但是,在融合过程中尚未研究特定模态之间的联系。本文首先研究了单峰的表达能力。由于注意力机制的特征,视觉内容的实例级别被用来完善时间特征。然后,在时空内容上还采用了基于CNN + RNN的语义检测架构,以利用语义标签之间的相关性来实现更好的视频语义表示。最后,通过共同考虑多模式特征的内在特性,提出了一种基于分层注意力的视频字幕多模式融合模型。在MSVD和MSR-VTT数据集上的实验结果表明,与相关的视频字幕方法相比,该方法具有竞争优势。 (c)2018 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号