Long-Term Video Question Answering via Multimodal Hierarchical Memory Attentive Networks

Yu Ting; Yu Jun; Yu Zhou; Huang Qingming; Tian Qi

首页> 外文期刊>IEEE Transactions on Circuits and Systems for Video Technology >Long-Term Video Question Answering via Multimodal Hierarchical Memory Attentive Networks

【24h】

Long-Term Video Question Answering via Multimodal Hierarchical Memory Attentive Networks

机译：通过多模式分层内存周度网络应答的长期视频问题

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Long-term Video Question Answering plays an essential role in visual information retrieval, which aims at generating natural language answers to discretionary free-form questions about the referenced long-term video. Rather than remember the video as a sequence of visual content, humans have an innate cognitive ability to identify the critical moments related to the question at first glance, then tie together the specific evidence around these critical moments for further analysis and reasoning. Motivated by this intuition, we propose the multimodal hierarchical memory attentive networks with two heterogeneous memory subnetworks: the top guided memory network and the bottom enhanced multimodal memory attentive network. The top guided memory network serves as a shallow inference engine to pick relevant and informative moments of questions and obtain salient video content at a coarse-grained level. Subsequently, the bottom enhanced multimodal memory attentive network is designed as an in-depth reasoning engine to perform more accurate attention with cues from video bottom evidence in a fine-grained level to enhance question answering quality. We evaluate the proposed method on three publicly available video question answering benchmarks, namely ActivityNet-QA, MSRVTT-QA, and MSVD-QA. Experimental results demonstrate that the proposed approach significantly outperforms other state-of-the-art methods for long-term videos. Extensive ablation studies are carried out to explore the reasons behind the proposed model's effectiveness.

机译：长期视频问题回答在视觉信息检索中发挥着重要作用，这旨在为关于参考的长期视频的自由形式问题产生自然语言答案。而不是记住视频作为视觉内容的序列，人类有一个先天的认知能力，以识别与问题有关的关键时刻，然后将这些关键时刻的具体证据系在一起，以进一步分析和推理。这种直觉的动机，我们提出了具有两个异构内存子网的多模式分层内存细分网络：顶部引导存储器网络和底部增强的多模式存储器细分网络。顶部引导的存储器网络用作浅推理引擎，以便在粗粒度级别挑选相关和信息性的问题并获得突出视频内容。随后，底部增强的多模式记忆分子网络被设计为深入推理引擎，以便在微粒水平中使用来自视频底部证据的线索更准确地关注，以增强问题应答质量。我们在三个公开可用的视频问题应答基准测试中评估所提出的方法，即ActivityNet-QA，MSRVTT-QA和MSVD-QA。实验结果表明，该方法显着优于长期视频的其他最先进的方法。进行广泛的消融研究，以探讨拟议模型效率背后的原因。

著录项

来源
《IEEE Transactions on Circuits and Systems for Video Technology》 |2021年第3期|931-944|共14页
作者
Yu Ting; Yu Jun; Yu Zhou; Huang Qingming; Tian Qi;
展开▼
作者单位

Hangzhou Dianzi Univ Key Lab Complex Syst Modeling & Simulat Sch Comp Sci & Technol Hangzhou 310018 Peoples R China|Zhejiang Univ Finance & Econ Sch Informat Dongfang Coll Haining 314408 Peoples R China;

Hangzhou Dianzi Univ Key Lab Complex Syst Modeling & Simulat Sch Comp Sci & Technol Hangzhou 310018 Peoples R China;

Hangzhou Dianzi Univ Key Lab Complex Syst Modeling & Simulat Sch Comp Sci & Technol Hangzhou 310018 Peoples R China;

Univ Chinese Acad Sci Sch Comp & Control Engn Beijing 101408 Peoples R China;

Noahs Ark Lab Huawei 518129 Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Knowledge discovery; Cognition; Visualization; Task analysis; Semantics; Engines; Computational modeling; Long-term; video question answering; multimodal; hierarchical; memory network; shallow inference; coarse-grained; fine-grained; in-depth reasoning;

机译：知识发现;认知;可视化;任务分析;语义;发动机;计算建模;长期;视频问题应答;多模式;分层;内存网络;浅粒度;细粒度;细粒度;深粒子;深粒;深粒;
入库时间 2022-08-18 23:30:49

相似文献

外文文献
中文文献
专利

1. Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks [J] . Zhou Zhao, Zhu Zhang, Shuwen Xiao, IEEE Transactions on Image Processing . 2019,第12期

机译：通过动态分层增强网络进行长视频提问
2. Memory Augmented Deep Recurrent Neural Network for Video Question Answering [J] . Yin Chengxiang, Tang Jian, Xu Zhiyuan, Neural Networks and Learning Systems, IEEE Transactions on . 2020,第9期

机译：内存增强了用于视频问题的深度经常性神经网络
3. Enhanced question understanding with dynamic memory networks for textual question answering [J] . Yue Chunyi, Cao Hanqiang, Xiong Kun, Expert Systems with Application . 2017,第SEPa期

机译：动态内存网络增强了对问题的理解，可用于文本问题解答
4. Bidirectional Attentive Memory Networks for Question Answering over Knowledge Bases [C] . Yu Chen, Lingfei Wu, Mohammed J. Zaki Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . 2019

机译：双向注意力记忆网络，用于知识库的问答
5. Inferring answer quality, answerer expertise, and ranking in question answer social networks. [D] . Cai, Yuanzhe. 2014

机译：推断回答质量，回答者专业知识以及对问题进行回答的社交网络的排名。
6. Efficacy-specific herbal group detection from traditional Chinese medicine prescriptions via hierarchical attentive neural network model [O] . Li Chen, Xinglong Liu, Siyuan Zhang, 2021

机译：通过分层周度神经网络模型从中药处方检测疗效特异性草药群体
7. Bidirectional Attentive Memory Networks for Question Answering over Knowledge Bases [O] . Yu Chen, Lingfei Wu, Mohammed J. Zaki 2019

机译：关于知识库的问题回答的双向周度记忆网络
8. First Steps Toward Linking Dialogues: Mediating Between Free-text Questions and Pre-recorded Video Answers [R] . Gandhe, S. , Gordon, A. , Leuski, A. , 2004

机译：连接对话的第一步：在自由文本问题和预先录制的视频答案之间进行调解

Long-Term Video Question Answering via Multimodal Hierarchical Memory Attentive Networks

摘要

著录项

相似文献

相关主题

期刊订阅