Rich Visual and Language Representation with Complementary Semantics for Video Captioning

Tang Pengjie; Wang Hanli; Li Qinyu

首页> 外文期刊>ACM transactions on multimedia computing communications and applications >Rich Visual and Language Representation with Complementary Semantics for Video Captioning

【24h】

Rich Visual and Language Representation with Complementary Semantics for Video Captioning

机译：丰富的视觉和语言表示以及带有辅助语义的视频字幕

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

It is interesting and challenging to translate a video to natural description sentences based on the video content. In this work, an advanced framework is built to generate sentences with coherence and rich semantic expressions for video captioning. A long short term memory (LSTM) network with an unproved factored way is first developed, which takes the inspiration of LSTM with a conventional factored way and a common practice to feed multi-modal features into LSTM at the first time step for visual description. Then, the incorporation of the LSTM network with the proposed improved factored way and un-factored way is exploited, and a voting strategy is utilized to predict candidate words. In addition, for robust and abstract visual and language representation, residuals are employed to enhance the gradient signals that are learned from the residual network (ResNet), and a deeper LSTM network is constructed. Furthermore, three convolutional neural network based features extracted from GoogLeNet, ResNet101, and ResNet152, are fused to catch more comprehensive and complementary visual information. Experiments are conducted on two benchmark datasets, including MSVD and MSR-VTT2016, and competitive performances are obtained by the proposed techniques as compared to other state-of-the-art methods.

机译：将视频转换为基于视频内容的自然描述语句既有趣又具有挑战性。在这项工作中，构建了一个高级框架来生成具有连贯性和丰富语义表达的句子，用于视频字幕。首先开发了一种未经证实的分解方式的长期短期记忆（LSTM）网络，该网络从LSTM的灵感中汲取了传统的分解方式，并且是在第一步中将多模式特征输入LSTM进行可视化描述的一种惯例。然后，将LSTM网络与所提出的改进的分解和非分解方法结合起来，并利用一种投票策略来预测候选单词。此外，为了获得健壮和抽象的视觉和语言表示，使用残差来增强从残差网络（ResNet）获悉的梯度信号，并构建更深的LSTM网络。此外，融合了从GoogLeNet，ResNet101和ResNet152中提取的三个基于卷积神经网络的特征，以捕获更全面和互补的视觉信息。在包括MSVD和MSR-VTT2016在内的两个基准数据集上进行了实验，与其他最新方法相比，通过拟议技术获得了竞争性能。

著录项

来源
《ACM transactions on multimedia computing communications and applications》 |2019年第2期|31.1-31.23|共23页
作者
Tang Pengjie; Wang Hanli; Li Qinyu;
展开▼
作者单位

Tongji Univ, Dept Comp Sci & Technol, Shanghai 201804, Peoples R China|Jinggangshan Univ, Coll Math & Phys, Jian 343009, Jiangxi, Peoples R China;

Tongji Univ, Dept Comp Sci & Technol, Shanghai 201804, Peoples R China;

Tongji Univ, Dept Comp Sci & Technol, Shanghai 201804, Peoples R China|Lanzhou City Univ, Dept Comp Sci, Lanzhou 730070, Gansu, Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Video captioning; long short term memory; convolutional neural network; sequential voting; complementary features;

机译：视频字幕;长期短期记忆;卷积神经网络;顺序投票;互补特征;

相似文献

外文文献
中文文献
专利

1. Rich Visual and Language Representation with Complementary Semantics for Video Captioning [J] . Tang Pengjie, Wang Hanli, Li Qinyu ACM transactions on multimedia computing communications and applications . 2019,第2期

机译：丰富的视觉和语言表示与视频标题的互补语义
2. Translating video into language by enhancing visual and language representations [J] . Tang Pengjie, Tan Yunlan, Li Jinzhong, Journal of visual communication & image representation . 2020,第Octa期

机译：通过增强视觉和语言表示将视频转换为语言
3. Learning semantic sentence representations from visually grounded language without lexical knowledge [J] . Merkx Danny, Frank Stefan L. Natural language engineering . 2019,第PTa4期

机译：在没有词汇知识的情况下从视觉基础的语言学习语义句子表示
4. Grounding language acquisition by training semantic parsers using captioned videos [C] . Candace Ross, Andrei Barbu, Yevgeni Berzak, Conference on empirical methods in natural language processing . 2018

机译：通过使用字幕视频训练语义解析器来掌握语言
5. The effect of the use of videos captioning on English as a foreign language (EFL) on college students' language learning in Taiwan (China). [D] . Hwang, Yan-Ling. 2003

机译：在台湾（中国）使用视频字幕作为外语英语（EFL）对大学生语言学习的影响。
6. A Comparison of Comprehension Processes in Sign Language Interpreter Videos with or without Captions [O] . Matjaž Debevc, Danijela Milošević, Ines Kožuh -1

机译：带或不带字幕的手语翻译视频中理解过程的比较
7. Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning [O] . Nayyer Aafaq, Naveed Akhtar, Wei Liu, 2019

机译：用于视频字幕的时空动态和语义属性丰富的视觉编码
8. Rich Representations with Exposed Semantics for Deep Visual Reasoning. [R] . Davis, L., Chellappa, R., Hoiem, D., 2016

机译：用于深度视觉推理的具有暴露语义的丰富表示。

Rich Visual and Language Representation with Complementary Semantics for Video Captioning

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅