首页> 外文会议>International Conference on Advanced Robotics and Mechatronics >Attention-based Visual-Audio Fusion for Video Caption Generation
【24h】

Attention-based Visual-Audio Fusion for Video Caption Generation

机译:用于视频字幕生成的基于注意力的视音频融合

获取原文

摘要

Recently, most of the work of generating a text description from a video is based on an Encoder-Decoder framework. Firstly, in the encoder stage, different convolutional neural networks are using to extract features from audio and visual modalities respectively, and then the extracted features are input into the decoder stage, which will use the LSTM to generate the caption of video. Currently, there are two types of work concerns. One is whether video caption will be generated accurately if different multimodal fusion strategies are adopted. Another is whether video caption will be generated more accurately if attention mechanism is added. In this paper, we come up with a fusion framework which combines the two types of methods concerned above to produce a new model. In the encoder stage, two modalities of multimodal fusion, sharing weights and sharing memory are utilized respectively, which can make the two kinds of characteristics resonated to generated the final feature outputs. LSTM with attention mechanism are used in the decoder state to generate a description of video. Our fusion model combining the two methods is well validated on the dataset Microsoft Research Video to Text (MSR-VTT).
机译:最近,从视频生成文本描述的大多数工作都是基于Encoder-Decoder框架的。首先,在编码器阶段,使用不同的卷积神经网络分别从音频和视觉模态中提取特征,然后将提取的特征输入到解码器阶段,后者将使用LSTM生成视频字幕。当前,有两种类型的工作问题。一是如果采用不同的多峰融合策略,视频字幕是否会准确生成。另一个问题是,如果添加注意力机制,视频字幕是否会更准确地生成。在本文中,我们提出了一个融合框架,该框架将上述两种相关方法结合起来以产生一个新模型。在编码器阶段,分别利用多模态融合,权重共享和内存共享两种方式,可以使两种特性产生共鸣,生成最终的特征输出。具有注意机制的LSTM在解码器状态下用于生成视频描述。我们结合了这两种方法的融合模型已在数据集Microsoft Research Video to Text(MSR-VTT)上得到了很好的验证。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号