Attention-based Visual-Audio Fusion for Video Caption Generation

机译：用于视频字幕生成的基于注意力的视音频融合

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Recently, most of the work of generating a text description from a video is based on an Encoder-Decoder framework. Firstly, in the encoder stage, different convolutional neural networks are using to extract features from audio and visual modalities respectively, and then the extracted features are input into the decoder stage, which will use the LSTM to generate the caption of video. Currently, there are two types of work concerns. One is whether video caption will be generated accurately if different multimodal fusion strategies are adopted. Another is whether video caption will be generated more accurately if attention mechanism is added. In this paper, we come up with a fusion framework which combines the two types of methods concerned above to produce a new model. In the encoder stage, two modalities of multimodal fusion, sharing weights and sharing memory are utilized respectively, which can make the two kinds of characteristics resonated to generated the final feature outputs. LSTM with attention mechanism are used in the decoder state to generate a description of video. Our fusion model combining the two methods is well validated on the dataset Microsoft Research Video to Text (MSR-VTT).

机译：最近，从视频生成文本描述的大多数工作都是基于Encoder-Decoder框架的。首先，在编码器阶段，使用不同的卷积神经网络分别从音频和视觉模态中提取特征，然后将提取的特征输入到解码器阶段，后者将使用LSTM生成视频字幕。当前，有两种类型的工作问题。一是如果采用不同的多峰融合策略，视频字幕是否会准确生成。另一个问题是，如果添加注意力机制，视频字幕是否会更准确地生成。在本文中，我们提出了一个融合框架，该框架将上述两种相关方法结合起来以产生一个新模型。在编码器阶段，分别利用多模态融合，权重共享和内存共享两种方式，可以使两种特性产生共鸣，生成最终的特征输出。具有注意机制的LSTM在解码器状态下用于生成视频描述。我们结合了这两种方法的融合模型已在数据集Microsoft Research Video to Text（MSR-VTT）上得到了很好的验证。

著录项

来源
《International Conference on Advanced Robotics and Mechatronics》|2019年|839-844|共6页
会议地点
作者
Ningning Guo; Huaping Liu; Linhua Jiang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Feature extraction; Visualization; Decoding; Logic gates; Conferences; Robots; Mechatronics;

机译：特征提取;可视化;解码;逻辑门;会议;机器人;机电一体化;

相似文献

外文文献
中文文献
专利

1. Hierarchical attention-based multimodal fusion for video captioning [J] . Wu Chunlei, Wei Yiwei, Chu Xiaoliang, Neurocomputing . 2018,第NOVa13期

机译：基于分层注意的多模式融合，用于视频字幕
2. Residual attention-based LSTM for video captioning [J] . Li Xiangpeng, Zhou Zhilong, Chen Lijiang, World Wide Web . 2019,第2期

机译：基于残留注意力的LSTM用于视频字幕
3. Video Captioning With Attention-Based LSTM and Semantic Consistency [J] . Lianli Gao, Zhao Guo, Hanwang Zhang, IEEE transactions on multimedia . 2017,第9期

机译：具有基于注意的LSTM和语义一致性的视频字幕
4. Attention-based Visual-Audio Fusion for Video Caption Generation [C] . Ningning Guo, Huaping Liu, Linhua Jiang International Conference on Advanced Robotics and Mechatronics . 2019

机译：基于关注的视频字幕生成的视觉音频融合
5. The effect of the use of videos captioning on English as a foreign language (EFL) on college students' language learning in Taiwan (China). [D] . Hwang, Yan-Ling. 2003

机译：在台湾（中国）使用视频字幕作为外语英语（EFL）对大学生语言学习的影响。
6. Eye movements while viewing narrated captioned and silent videos [O] . Nicholas M. Ross, Eileen Kowler -1

机译：观看旁白字幕和无声视频时的眼球运动
7. Multi-Task Video Captioning with Video and Entailment Generation [O] . Pasunuru, Ramakanth, Bansal, Mohit 2017

机译：具有视频和蕴涵生成的多任务视频字幕

Attention-based Visual-Audio Fusion for Video Caption Generation

摘要

著录项

相似文献

相关主题

期刊订阅