Multimodal Deep Neural Network with Image Sequence Features for Video Captioning

机译：具有图像序列功能的多模式深度神经网络，用于视频字幕

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we propose MDNNiSF (Multimodal Deep Neural Network with image Sequence Features) for generating a sentence description of a given video clip. A recently proposed model, S2VT, uses a stack of two LSTMs to solve the problem and demonstrated high METEOR. However, experiments show that S2VT sometimes produces inaccurate sentences, which is quite natural due to the challenging nature of learning relationships between visual and textual contents. A possible reason is that the video caption data were still small for the purpose. We try to circumvent this flaw by integrating S2VT with NeuralTalk2, which is for image captioning and known to generate an accurate description due to its capability of learning alignments between text fragments to image fragments. Experiments using two video caption data, MSVD and MSRVTT, demonstrate the effectiveness of our MDNNiSF over S2VT. For example, MDNNiSF achieved METEOR 0.344, which is 21.5% higher than S2VT, with MSVD.

机译：在本文中，我们提出了MDNNiSF（具有图像序列特征的多模式深度神经网络）来生成给定视频剪辑的句子描述。最近提出的模型S2VT使用两个LSTM的堆栈来解决此问题，并显示出较高的METEOR。但是，实验表明，S2VT有时会产生不准确的句子，由于学习视觉和文本内容之间的关系具有挑战性，所以这很自然。可能的原因是视频字幕数据仍然很小。我们尝试通过将S2VT与NeuralTalk2集成来规避此缺陷，该功能用于图像字幕，并且由于其能够学习文本片段与图像片段之间的对齐方式而已知能够生成准确的描述。使用两个视频字幕数据MSVD和MSRVTT进行的实验证明了我们的MDNNiSF优于S2VT。例如，使用MSVD，MDNNiSF达到METEOR 0.344，比S2VT高21.5％。

著录项

来源
《International Joint Conference on Neural Networks》|2018年|1-7|共7页
会议地点
作者
Soichiro Oura; Tetsu Matsukawa; Einoshin Suzuki;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Image sequences; Recurrent neural networks; Encoding; Logic gates; Decoding; Visualization;

机译：图像序列;递归神经网络;编码;逻辑门;解码;可视化;

相似文献

外文文献
中文文献
专利

1. Brain Tumor Segmentation Basedon Features Extracted From MRI Multimodal Images Using Deep Convolution NeuralNetworks [J] . Zhang B., Lin H., Xue Z., Medical Physics . 2019,第6期

机译：使用深卷积神经网络从MRI多模式图像提取的脑肿瘤分割的特征
2. Discriminative feature representation for image classification via multimodal multitask deep neural networks [J] . Mei Shuang, Yang Hua, Yin Zhouping Journal of electronic imaging . 2017,第1期

机译：通过多模态多任务深度神经网络进行图像分类的判别特征表示
3. A Hierarchical Multimodal Attention-based Neural Network for Image Captioning [J] . Yong Cheng, Fei Huang, Lian Zhou, ACM SIGIR FORUM . 2017,第cd期

机译：基于分层多模式注意力的神经网络的图像字幕
4. Multimodal Deep Neural Network with Image Sequence Features for Video Captioning [C] . Soichiro Oura, Tetsu Matsukawa, Einoshin Suzuki International Joint Conference on Neural Networks . 2018

机译：具有图像序列特征的多模式深神经网络，用于视频字幕
5. Automatic Video Captioning using Deep Neural Network. [D] . Nguyen, Thang Huy. 2017

机译：使用深度神经网络的自动视频字幕。
6. Gender Recognition from Human-Body Images Using Visible-Light and Thermal Camera Videos Based on a Convolutional Neural Network for Image Feature Extraction [O] . Dat Tien Nguyen, Ki Wan Kim, Hyung Gil Hong, 2017

机译：基于卷积神经网络的可见光和热成像摄像机视频对人体图像的性别识别
7. Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking [O] . Vukotić, Vedran, Raymond, Christian, Gravier, Guillaume 2016

机译：通过双向深度神经网络从文本和视觉特征中学习多模式和交叉模式表示，以进行视频超链接

Multimodal Deep Neural Network with Image Sequence Features for Video Captioning

摘要

著录项

相似文献

相关主题

期刊订阅