...
首页> 外文期刊>JMLR: Workshop and Conference Proceedings >TVT: Two-View Transformer Network for Video Captioning
【24h】

TVT: Two-View Transformer Network for Video Captioning

机译:TVT:用于视频字幕的双视图变压器网络

获取原文
           

摘要

Video captioning is a task of automatically generating the natural text description of a given video. There are two main challenges in video captioning under the context of an encoder-decoder framework: 1) How to model the sequential information; 2) How to combine the modalities including video and text. For challenge 1), the recurrent neural networks (RNNs) based methods are currently the most common approaches for learning temporal representations of videos, while they suffer from a high computational cost. For challenge 2), the features of different modalities are often roughly concatenated together without insightful discussion. In this paper, we introduce a novel video captioning framework, i.e., Two-View Transformer (TVT). TVT comprises of a backbone of Transformer network for sequential representation and two types of fusion blocks in decoder layers for combining different modalities effectively. Empirical study shows that our TVT model outperforms the state-of-the-art methods on the MSVD dataset and achieves a competitive performance on the MSR-VTT dataset under four common metrics.
机译:视频字幕是自动生成给定视频的自然文本描述的任务。在编码器 - 解码器框架的上下文下,视频字幕存在两个主要挑战:1)如何建模顺序信息; 2)如何将包括视频和文本在内的模式组合。对于挑战1),基于经常性的神经网络(RNNS)的方法是目前学习视频的时间表示的最常见方法,而它们遭受高计算成本。对于挑战2),无富有识别讨论,不同模式的特征通常在一起粗略地衔接。在本文中,我们介绍了一种新颖的视频标题框架,即双视图变压器(TVT)。 TVT包括用于顺序表示的变压器网络的骨干,以及解码器层中的两种类型的融合块,用于有效地组合不同的模态。实证研究表明,我们的TVT模型优于MSVD数据集的最先进方法,并在四个常见度量标准下实现了MSR-VTT数据集的竞争性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号