Natural Language Description for Videos Using NetVLAD and Attentional LSTM

机译：使用NetVLAD和Attentional LSTM的视频的自然语言描述

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Video captioning infers the process of generating textual description from videos which describes the objects and actions present in it. The multimodal information is available on each frame based on texture and time in the video. In video captioning, the tremendous task is generating the caption automatically related to video content precisely. By using advancement in the field of deep learning technology, a model was developed to generates natural-language descriptions for activities in the video is proposed. In our proposed work, the first stage is extracting the key features for machine understandable about the video content using 2D and 3D CNN. The convolutional neural network(CNN) of 2D and 3D is used to extract both the spatial and temporal features respectively for transferring the videos into key features. The extracted features are preprocessed using NetVLAD. After NetVLAD preprocessing, the features are concatenated and given as input into attention based Long-Short Term Memory(aLSTM). aLSTM generates sentences in a sequential manner by selecting the salient features. The expected output of the model is a sentence to describe the contents of the video. The evaluation is done by using Bilingual Evaluation Understudy (BLEU) metrics.

机译：视频字幕说明了从视频生成文本描述的过程，该过程描述了视频中存在的对象和动作。基于视频中的纹理和时间，多帧信息可用于每个帧。在视频字幕中，一项艰巨的任务是精确地自动生成与视频内容相关的字幕。通过利用深度学习技术领域的进步，提出了一种用于生成视频中活动的自然语言描述的模型。在我们提出的工作中，第一步是使用2D和3D CNN提取机器可理解的有关视频内容的关键功能。使用2D和3D的卷积神经网络（CNN）分别提取空间和时间特征，以将视频转换为关键特征。提取的特征使用NetVLAD进行预处理。经过NetVLAD预处理后，将这些功能串联起来，并作为输入输入到基于注意力的长期记忆（aLSTM）中。 aLSTM通过选择突出特征以顺序方式生成句子。模型的预期输出是描述视频内容的句子。通过使用双语评估咨询（BLEU）指标来完成评估。

著录项

来源
《International Conference for Emerging Technology》|2020年|1-6|共6页
会议地点
作者
Jeevitha V K; Hemalatha M;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Videos; Feature extraction; Two dimensional displays; Natural languages; Semantics; Task analysis; Machine learning;

机译：视频;特征提取;二维显示;自然语言;语义;任务分析;机器学习;

相似文献

外文文献
中文文献
专利

1. Script identification in natural scene image and video frames using an attention based Convolutional-LSTM network [J] . Bhunia Ankan Kumar, Konwer Aishik, Bhunia Ayan Kumar, Pattern Recognition: The Journal of the Pattern Recognition Society . 2019,第期

机译：使用基于Concutional-LSTM网络的注意力的自然场景图像和视频帧中的脚本识别
2. Exposing DeepFake Videos Using Attention Based Convolutional LSTM Network [J] . Su Yishan, Xia Huawei, Liang Qi, Neural processing letters . 2021,第6期

机译：使用基于注意的卷积LSTM网络曝光DeepFake视频
3. Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos [J] . IEEE transactions on multimedia . 2020,第3期

机译：深度多核卷积LSTM网络和基于注意力的视频机制
4. Generating Short Video Description using Deep-LSTM and Attention Mechanism [C] . Naveen Yadav, Dinesh Naik International Conference for Convergence in Technology . 2021

机译：使用Deep-LSTM和注意机制生成短视频描述
5. An Application of Natural Language Processing: Named Entity Recognition with BLSTM in Chinese Corpora [D] . Mao, Lihui 2019

机译：自然语言处理的应用：BLSTM在中文语料库中的命名实体识别
6. Language Production Strategies and Disfluencies in Multi-Clause Network Descriptions: A Study of Adult Attention-Deficit/Hyperactivity Disorder [O] . Paul E. Engelhardt, Fernanda Ferreira, Joel T. Nigg -1

机译：多条款网络描述中的语言制作策略和无与伦比：对成人注意力/多动障碍的研究
7. Script Identification in Natural Scene Image and Video Frame using Attention based Convolutional-LSTM Network [O] . Bhunia, Ankan Kumar, Konwer, Aishik, Bhowmick, Abir, 2018

机译：自然场景图像和视频帧中的脚本识别基于注意力的卷积LsTm网络
8. Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild. [R] . Thomason, J., Venugopalan, S., Guadarrama, S., 2014

机译：整合语言和视觉，生成自然语言对野外视频的描述。

Natural Language Description for Videos Using NetVLAD and Attentional LSTM

摘要

著录项

相似文献

相关主题

期刊订阅