TVT: Two-View Transformer Network for Video Captioning

Ming Chen; Yingming Li; Zhongfei Zhang; Siyu Huang

首页> 外文期刊>JMLR: Workshop and Conference Proceedings >TVT: Two-View Transformer Network for Video Captioning

【24h】

TVT: Two-View Transformer Network for Video Captioning

机译：TVT：用于视频字幕的双视图变压器网络

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Video captioning is a task of automatically generating the natural text description of a given video. There are two main challenges in video captioning under the context of an encoder-decoder framework: 1) How to model the sequential information; 2) How to combine the modalities including video and text. For challenge 1), the recurrent neural networks (RNNs) based methods are currently the most common approaches for learning temporal representations of videos, while they suffer from a high computational cost. For challenge 2), the features of different modalities are often roughly concatenated together without insightful discussion. In this paper, we introduce a novel video captioning framework, i.e., Two-View Transformer (TVT). TVT comprises of a backbone of Transformer network for sequential representation and two types of fusion blocks in decoder layers for combining different modalities effectively. Empirical study shows that our TVT model outperforms the state-of-the-art methods on the MSVD dataset and achieves a competitive performance on the MSR-VTT dataset under four common metrics.

机译：视频字幕是自动生成给定视频的自然文本描述的任务。在编码器 - 解码器框架的上下文下，视频字幕存在两个主要挑战：1）如何建模顺序信息; 2）如何将包括视频和文本在内的模式组合。对于挑战1），基于经常性的神经网络（RNNS）的方法是目前学习视频的时间表示的最常见方法，而它们遭受高计算成本。对于挑战2），无富有识别讨论，不同模式的特征通常在一起粗略地衔接。在本文中，我们介绍了一种新颖的视频标题框架，即双视图变压器（TVT）。 TVT包括用于顺序表示的变压器网络的骨干，以及解码器层中的两种类型的融合块，用于有效地组合不同的模态。实证研究表明，我们的TVT模型优于MSVD数据集的最先进方法，并在四个常见度量标准下实现了MSR-VTT数据集的竞争性能。

著录项

来源
《JMLR: Workshop and Conference Proceedings》 |2018年第2010期|共16页
作者
Ming Chen; Yingming Li; Zhongfei Zhang; Siyu Huang;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Accelerated masked transformer for dense video captioning [J] . Yu Zhou, Han Nanjia Neurocomputing . 2021,第Jula20期

机译：加速遮蔽变压器，用于密集视频标题
2. Exploring the effects of non-local blocks on video captioning networks [J] . Jaeyoung Lee, Junmo Kim International journal of computational vision and robotics . 2019,第5期

机译：探索非本地块对视频字幕网络的影响
3. Multimodal architecture for video captioning with memory networks and an attention mechanism [J] . Li Wei, Guo Dashan, Fang Xiangzhong Pattern recognition letters . 2018,第APRa1期

机译：具有存储网络的视频字幕多模式体系结构和一种注意机制
4. Position embedding fusion on transformer for dense video captioning [C] . Sixuan Yang, Pengjie Tang, Hanli Wang, International FLINS Conference . 2020

机译：在变压器上进行位置嵌入融合，以实现密集的视频字幕
5. Automatic Video Captioning using Deep Neural Network. [D] . Nguyen, Thang Huy. 2017

机译：使用深度神经网络的自动视频字幕。
6. Eye movements while viewing narrated captioned and silent videos [O] . Nicholas M. Ross, Eileen Kowler -1

机译：观看旁白字幕和无声视频时的眼球运动
7. SBAT: Video Captioning with Sparse Boundary-Aware Transformer [O] . Tao Jin, Siyu Huang, Ming Chen, 2020

机译：SBAT：具有稀疏边界感知变压器的视频字幕

TVT: Two-View Transformer Network for Video Captioning

摘要

著录项

相似文献

相关主题

期刊订阅