首页> 外文会议>International Conference on Pattern Recognition >Context Visual Information-based Deliberation Network for Video Captioning
【24h】

Context Visual Information-based Deliberation Network for Video Captioning

机译:基于语境视觉信息的审阅网络用于视频字幕

获取原文

摘要

Video captioning automatically and accurately generates a textual description for a video. The typical methods following the encoder-decoder architecture directly utilize hidden states to predict words. Nevertheless, these methods do not amend the inaccurate hidden states before feeding those states into word prediction. This leads to a cascade of errors in generating word by word. In this paper, the context visual information-based deliberation network is proposed, abbreviated as CVI-DeINet. Its key idea is to introduce a deliberator into the encoder-decoder framework. The encoder-decoder first generates a raw hidden state sequence. Unlike the existing methods, the raw hidden state is no longer directly used for word prediction but is fed into the deliberator to generate the refined hidden state. The words are then predicted according to the refined hidden states and the contextual visual features. The results on two datasets show that the proposed method significantly outperforms the state-of-the-art methods.
机译:自动和准确地为视频产生文本描述。编码器解码器架构后的典型方法直接利用隐藏状态来预测单词。然而,在将这些国家喂入Word预测之前,这些方法不会修改不准确的隐藏状态。这导致通过Word生成单词的级联错误。在本文中,提出了基于语境视觉信息的审议网络,缩写为CVI-Deinet。其关键的想法是将一个议案介绍到编码器解码器框架中。编码器解码器首先生成原始隐藏状态序列。与现有方法不同,RAW隐藏状态不再直接用于字预测,而是被馈送到议案中以生成精细的隐藏状态。然后根据精细的隐藏状态和上下文视觉特征预测单词。两个数据集上的结果表明,该方法显着优于最先进的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号