Video Captioning With Attention-Based LSTM and Semantic Consistency

Lianli Gao; Zhao Guo; Hanwang Zhang; Xing Xu; Heng Tao Shen

首页> 外文期刊>IEEE transactions on multimedia >Video Captioning With Attention-Based LSTM and Semantic Consistency

【24h】

Video Captioning With Attention-Based LSTM and Semantic Consistency

机译：具有基于注意的LSTM和语义一致性的视频字幕

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Recent progress in using long short-term memory (LSTM) for image captioning has motivated the exploration of their applications for video captioning. By taking a video as a sequence of features, an LSTM model is trained on video-sentence pairs and learns to associate a video to a sentence. However, most existing methods compress an entire video shot or frame into a static representation, without considering attention mechanism which allows for selecting salient features. Furthermore, existing approaches usually model the translating error, but ignore the correlations between sentence semantics and visual content. To tackle these issues, we propose a novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences. This framework integrates attention mechanism with LSTM to capture salient structures of video, and explores the correlation between multimodal representations (i.e., words and visual content) for generating sentences with rich semantic content. Specifically, we first propose an attention mechanism that uses the dynamic weighted sum of local two-dimensional convolutional neural network representations. Then, an LSTM decoder takes these visual features at time and the word-embedding feature at time 1 to generate important words. Finally, we use multimodal embedding to map the visual and sentence features into a joint space to guarantee the semantic consistence of the sentence description and the video visual content. Experiments on the benchmark datasets demonstrate that our method using single feature can achieve competitive or even better results than the state-of-the-art baselines for video captioning in both BLEU and METEOR.

机译：使用长短期记忆（LSTM）进行图像字幕的最新进展激发了人们对其视频字幕应用的探索。通过将视频作为特征序列，可以在视频句子对上训练LSTM模型，并学习将视频与句子相关联。然而，大多数现有方法将整个视频镜头或帧压缩为静态表示，而不考虑允许选择突出特征的注意力机制。此外，现有的方法通常为翻译错误建模，但是忽略了句子语义和视觉内容之间的相关性。为了解决这些问题，我们提出了一个新颖的端到端框架，名为aLSTM，这是一种具有语义一致性的基于注意力的LSTM模型，可以将视频转换为自然句子。该框架将注意力机制与LSTM集成在一起，以捕获视频的显着结构，并探索多模态表示（即单词和视觉内容）之间的相关性，以生成具有丰富语义内容的句子。具体而言，我们首先提出一种使用局部二维卷积神经网络表示形式的动态加权总和的注意机制。然后，LSTM解码器在时间上获取这些视觉特征，并在时间1上获取词嵌入功能，以生成重要的词。最后，我们使用多模式嵌入将视觉和句子特征映射到联合空间中，以确保句子描述和视频视觉内容的语义一致性。在基准数据集上进行的实验表明，与使用BLEU和METEOR的视频字幕的最新基准相比，使用单一功能的方法可以实现竞争甚至更好的结果。

著录项

来源
《IEEE transactions on multimedia》 |2017年第9期|2045-2055|共11页
作者
Lianli Gao; Zhao Guo; Hanwang Zhang; Xing Xu; Heng Tao Shen;
展开▼
作者单位

Center of Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China;

Center of Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China;

Department of Computer Science, Columbia University, New York, NY, USA;

Center of Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China;

Center of Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Visualization; Semantics; Two dimensional displays; Neural networks; Computational modeling; Feature extraction; Correlation;

机译：可视化;语义;二维显示;神经网络;计算建模;特征提取;相关;

相似文献

外文文献
中文文献
专利

1. Residual attention-based LSTM for video captioning [J] . Li Xiangpeng, Zhou Zhilong, Chen Lijiang, World Wide Web . 2019,第2期

机译：基于残留注意力的LSTM用于视频字幕
2. Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos [J] . IEEE transactions on multimedia . 2020,第3期

机译：深度多核卷积LSTM网络和基于注意力的视频机制
3. Attention-based spatial–temporal hierarchical ConvLSTM network for action recognition in videos [J] . Computer Vision, IET . 2019,第8期

机译：基于注意力的时空分层ConvLSTM网络，用于视频中的动作识别
4. Effect of Batch Normalization and Stacked LSTMs on Video Captioning [C] . Vishwanath Sarathi, Ajit Mujumdar, Dinesh Naik International Conference on Computing Methodologies and Communication . 2021

机译：批量标准化和堆叠LSTMS对视频字幕的影响
5. The effect of the use of videos captioning on English as a foreign language (EFL) on college students' language learning in Taiwan (China). [D] . Hwang, Yan-Ling. 2003

机译：在台湾（中国）使用视频字幕作为外语英语（EFL）对大学生语言学习的影响。
6. A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling [O] . Haoran Chen, Ke Lin, Alexander Maye, 2020

机译：具有预定采样的语义辅助视频标题模型
7. Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos [O] . Sebastian Agethen, Winston H. Hsu 2020

机译：深度多核卷积LSTM网络和基于关注的视频机制

Video Captioning With Attention-Based LSTM and Semantic Consistency

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅