Compositional Attention Networks With Two-Stream Fusion for Video Question Answering

Yu Ting; Yu Jun; Yu Zhou; Tao Dacheng

首页> 外文期刊>IEEE Transactions on Image Processing >Compositional Attention Networks With Two-Stream Fusion for Video Question Answering

【24h】

Compositional Attention Networks With Two-Stream Fusion for Video Question Answering

机译：具有双流融合的组成注意网络视频问题应答

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Given a video, Video Question Answering (VideoQA) aims at answering arbitrary free-form questions about the video content in natural language. A successful VideoQA framework usually has the following two key components: 1) a discriminative video encoder that learns the effective video representation to maintain as much information as possible about the video and 2) a question-guided decoder that learns to select the most related features to perform spatiotemporal reasoning, as well as outputs the correct answer. We propose compositional attention networks (CAN) with two-stream fusion for VideoQA tasks. For the encoder, we sample video snippets using a two-stream mechanism (i.e., a uniform sampling stream and an action pooling stream) and extract a sequence of visual features for each stream to represent the video semantics with implementation. For the decoder, we propose a compositional attention module to integrate the two-stream features with the attention mechanism. The compositional attention module is the core of CAN and can be seen as a modular combination of a unified attention block. With different fusion strategies, we devise five compositional attention module variants. We evaluate our approach on one long-term VideoQA dataset, ActivityNet-QA, and two short-term VideoQA datasets, MSRVTT-QA and MSVD-QA. Our CAN model achieves new state-of-the-art results on all the datasets.

机译：鉴于视频，视频问题应答（VideoQA）旨在回答关于自然语言中的视频内容的任意自由形式问题。成功的视频仪框架通常具有以下两个关键组件：1）一个判别视频编码器，用于了解有效的视频表示，以维持诸如视频和2）一个有问题的解码器，其学习选择最多的功能执行时空推理，以及输出正确答案。我们提出了具有两流融合的组成关注网络（CAN），用于视频仪任务。对于编码器，我们使用双流机制（即，均匀采样流和动作池流）来提取视频片段的视频片段，并为每个流提取一系列视觉特征，以表示具有实现的视频语义。对于解码器，我们提出了一种组成注意模块，以将双流特征与注意机制集成。组成注意模块是罐的核心，可以被视为统一注意块的模块化组合。通过不同的融合策略，我们设计了五种组成注意力模块变体。我们在一个长期VideoQA数据集，ActivityNET-QA和两个短期VideoQA数据集，MSRVTT-QA和MSVD-QA上评估我们的方法。我们的CAN模型在所有数据集上实现了新的最先进结果。

著录项

来源
《IEEE Transactions on Image Processing》 |2020年第2020期|1204-1218|共15页
作者
Yu Ting; Yu Jun; Yu Zhou; Tao Dacheng;
展开▼
作者单位

Hangzhou Dianzi Univ Key Lab Complex Syst Modeling & Simulat Sch Comp Sci & Technol Hangzhou 310018 Peoples R China;

Hangzhou Dianzi Univ Key Lab Complex Syst Modeling & Simulat Sch Comp Sci & Technol Hangzhou 310018 Peoples R China;

Hangzhou Dianzi Univ Key Lab Complex Syst Modeling & Simulat Sch Comp Sci & Technol Hangzhou 310018 Peoples R China;

Univ Sydney Fac Engn & Informat Technol UBTECH Sydney Artificial Intelligence Ctr Darlington NSW 2008 Australia|Univ Sydney Sch Informat Technol Fac Engn & Informat Technol Darlington NSW 2008 Australia;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Visualization; Streaming media; Knowledge discovery; Feature extraction; Proposals; Task analysis; Semantics; Video question answering; spatiotemporal reasoning; compositional attention; uniform sampling stream; action pooling stream;

机译：可视化;流媒体;知识发现;特征提取;提案;任务分析;语义;视频问题应答;时尚推理;组成注意;均匀的采样流;行动汇集流;

相似文献

外文文献
中文文献
专利

1. Video-based person re-identification via spatio-temporal attentional and two-stream fusion convolutional networks [J] . Ouyang Deqiang, Zhang Yonghui, Shao Jie Pattern recognition letters . 2019,第JANa期

机译：通过时空注意和两流融合卷积网络的基于视频的人重新识别
2. Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering [J] . Shaoning Xiao, Yimeng Li, Yunan Ye, Neural processing letters . 2020,第2期

机译：视频问题回答的多粒子关注特征的分层时间融合
3. Attention Based Multi-Modal Fusion Architecture for Open-Ended Video Question Answering Systems [J] . Sumedh Pendurkar, Sameer Kolpekwar, Shreyas Dhoot, Procedia Computer Science . 2020,第5期

机译：基于关注的开放式视频问题应答系统的多模态融合架构
4. Structured Two-Stream Attention Network for Video Question Answering [C] . Lianli Gao, Pengpeng Zeng, Jingkuan Song, AAAI Conference on Artificial Intelligence . 2019

机译：用于视频问题的结构化两流关注网络
5. Inferring answer quality, answerer expertise, and ranking in question answer social networks. [D] . Cai, Yuanzhe. 2014

机译：推断回答质量，回答者专业知识以及对问题进行回答的社交网络的排名。
6. Two-Stream Attention Network for Pain Recognition from Video Sequences [O] . Patrick Thiam, Hans A. Kestler, Friedhelm Schwenker 2020

机译：用于从视频序列中识别疼痛的两流注意力网络
7. Structured Two-Stream Attention Network for Video Question Answering [O] . Lianli Gao, Pengpeng Zeng, Jingkuan Song, 2019

机译：用于视频问题的结构化两流关注网络
8. First Steps Toward Linking Dialogues: Mediating Between Free-text Questions and Pre-recorded Video Answers [R] . Gandhe, S. , Gordon, A. , Leuski, A. , 2004

机译：连接对话的第一步：在自由文本问题和预先录制的视频答案之间进行调解

Compositional Attention Networks With Two-Stream Fusion for Video Question Answering

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅