首页> 外文期刊>IEEE Transactions on Circuits and Systems for Video Technology >Long-Term Video Question Answering via Multimodal Hierarchical Memory Attentive Networks
【24h】

Long-Term Video Question Answering via Multimodal Hierarchical Memory Attentive Networks

机译:通过多模式分层内存周度网络应答的长期视频问题

获取原文
获取原文并翻译 | 示例
       

摘要

Long-term Video Question Answering plays an essential role in visual information retrieval, which aims at generating natural language answers to discretionary free-form questions about the referenced long-term video. Rather than remember the video as a sequence of visual content, humans have an innate cognitive ability to identify the critical moments related to the question at first glance, then tie together the specific evidence around these critical moments for further analysis and reasoning. Motivated by this intuition, we propose the multimodal hierarchical memory attentive networks with two heterogeneous memory subnetworks: the top guided memory network and the bottom enhanced multimodal memory attentive network. The top guided memory network serves as a shallow inference engine to pick relevant and informative moments of questions and obtain salient video content at a coarse-grained level. Subsequently, the bottom enhanced multimodal memory attentive network is designed as an in-depth reasoning engine to perform more accurate attention with cues from video bottom evidence in a fine-grained level to enhance question answering quality. We evaluate the proposed method on three publicly available video question answering benchmarks, namely ActivityNet-QA, MSRVTT-QA, and MSVD-QA. Experimental results demonstrate that the proposed approach significantly outperforms other state-of-the-art methods for long-term videos. Extensive ablation studies are carried out to explore the reasons behind the proposed model's effectiveness.
机译:长期视频问题回答在视觉信息检索中发挥着重要作用,这旨在为关于参考的长期视频的自由形式问题产生自然语言答案。而不是记住视频作为视觉内容的序列,人类有一个先天的认知能力,以识别与问题有关的关键时刻,然后将这些关键时刻的具体证据系在一起,以进一步分析和推理。这种直觉的动机,我们提出了具有两个异构内存子网的多模式分层内存细分网络:顶部引导存储器网络和底部增强的多模式存储器细分网络。顶部引导的存储器网络用作浅推理引擎,以便在粗粒度级别挑选相关和信息性的问题并获得突出视频内容。随后,底部增强的多模式记忆分子网络被设计为深入推理引擎,以便在微粒水平中使用来自视频底部证据的线索更准确地关注,以增强问题应答质量。我们在三个公开可用的视频问题应答基准测试中评估所提出的方法,即ActivityNet-QA,MSRVTT-QA和MSVD-QA。实验结果表明,该方法显着优于长期视频的其他最先进的方法。进行广泛的消融研究,以探讨拟议模型效率背后的原因。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号