...
首页> 外文期刊>Pattern recognition letters >Sequence in sequence for video captioning
【24h】

Sequence in sequence for video captioning

机译:视频字幕的顺序播放

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

For video captioning, the words in the caption are closely related to an overall understanding of the video. Thus, a suitable representation for the video is rather important for the description. For more precise words in the task of video captioning, we aim to encode the video feature for current word at each time-stamp of the generation process. This paper proposes a new framework of 'Sequence in Sequence' to encode the sequential frames into a spatio-temporal representation at each time-stamp to utter a word and further distill most related visual content by an extra semantic loss. First, we aggregate the sequential frames to extract related visual content guided by last word, and get a representation with rich spatio-temporal information. Then, to decode the aggregated representation for a precise word, we leverage two layers of GRU structure, where the first layer further distills useful visual content based on an extra semantic loss and the second layer selects the correct word according to the distilled features. Experiments on two benchmark datasets demonstrate that our method outperforms the current state-ofthe-art methods on Bleu@4, METEOR and CIDEr metrics. (C) 2018 Elsevier B.V. All rights reserved.
机译:对于视频字幕,字幕中的单词与视频的整体理解紧密相关。因此,视频的合适表示对于描述而言非常重要。为了使视频字幕任务中的单词更精确,我们旨在在生成过程的每个时间戳上对当前单词的视频功能进行编码。本文提出了一种“序列中的序列”的新框架,以在每个时间戳上将序列帧编码为时空表示形式,以说出单词并通过额外的语义损失进一步提取大多数相关的视觉内容。首先,我们聚合顺序帧以提取以单词为指导的相关视觉内容,并获得具有丰富时空信息的表示形式。然后,为了解码精确词的聚合表示,我们利用了两层GRU结构,其中第一层根据额外的语义损失进一步提取有用的视觉内容,第二层根据提取的特征选择正确的词。在两个基准数据集上进行的实验表明,我们的方法在Bleu @ 4,METEOR和CIDEr指标方面优于当前的最新方法。 (C)2018 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号