Context Visual Information-based Deliberation Network for Video Captioning

机译：基于语境视觉信息的审阅网络用于视频字幕

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Video captioning automatically and accurately generates a textual description for a video. The typical methods following the encoder-decoder architecture directly utilize hidden states to predict words. Nevertheless, these methods do not amend the inaccurate hidden states before feeding those states into word prediction. This leads to a cascade of errors in generating word by word. In this paper, the context visual information-based deliberation network is proposed, abbreviated as CVI-DeINet. Its key idea is to introduce a deliberator into the encoder-decoder framework. The encoder-decoder first generates a raw hidden state sequence. Unlike the existing methods, the raw hidden state is no longer directly used for word prediction but is fed into the deliberator to generate the refined hidden state. The words are then predicted according to the refined hidden states and the contextual visual features. The results on two datasets show that the proposed method significantly outperforms the state-of-the-art methods.

机译：自动和准确地为视频产生文本描述。编码器解码器架构后的典型方法直接利用隐藏状态来预测单词。然而，在将这些国家喂入Word预测之前，这些方法不会修改不准确的隐藏状态。这导致通过Word生成单词的级联错误。在本文中，提出了基于语境视觉信息的审议网络，缩写为CVI-Deinet。其关键的想法是将一个议案介绍到编码器解码器框架中。编码器解码器首先生成原始隐藏状态序列。与现有方法不同，RAW隐藏状态不再直接用于字预测，而是被馈送到议案中以生成精细的隐藏状态。然后根据精细的隐藏状态和上下文视觉特征预测单词。两个数据集上的结果表明，该方法显着优于最先进的方法。

著录项

来源
《International Conference on Pattern Recognition》|2021年|9812-9818|共7页
会议地点
作者
Min Lu; Xueyong Li; Caihua Liu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Visualization; Semantics; Coherence; Benchmark testing; Pattern recognition; Decoding;

机译：可视化;语义;一致性;基准测试;模式识别;解码;

相似文献

外文文献
中文文献
专利

1. The Visual Experience of Accessing Captioned Television and Digital Videos [J] . Butler Janine Television & new media . 2020,第7期

机译：访问标题电视和数字视频的视觉体验
2. Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism [J] . Guo Dashan, Li Wei, Fang Xiangzhong Neural processing letters . 2017,第1期

机译：通过时空上下文和频道注意机制捕获视频字幕的时间结构
3. Modeling and Performance Evaluation of a Context Information-Based Optimized Handover Scheme in 5G Networks [J] . Dong Yeong Seo, Yun Won Chung Entropy . 2017,第7期

机译：5G网络中基于上下文信息的优化切换方案的建模和性能评估
4. Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning [C] . Bang Yang, Yuexian Zou International Conference on Pattern Recognition . 2021

机译：视觉面向编码器：对视频字幕的多模和多尺度上下文集成
5. Automatic Video Captioning using Deep Neural Network. [D] . Nguyen, Thang Huy. 2017

机译：使用深度神经网络的自动视频字幕。
6. PainNetworks: A web-based resource for the visualisation of pain-related genes in the context of their network associations [O] . James R. Perkins, Jonathan Lees, Ana Antunes-Martins, -1

机译：PainNetworks：一种基于Web的资源用于在其网络关联的背景下可视化疼痛相关基因
7. Context-Aware Visual Policy Network for Sequence-Level Image Captioning [O] . Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, 2018

机译：用于序列级图像标题的上下文感知视觉策略网络

Context Visual Information-based Deliberation Network for Video Captioning

摘要

著录项

相似文献

相关主题

期刊订阅