Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

机译：它存在于何处：时态视频为多种形式的句子奠定基础

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG). Given an untrimmed video and a declarative/interrogative sentence depicting an object, STVG aims to localize the spatio-temporal tube of the queried object. STVG has two challenging settings: (1) We need to localize spatio-temporal object tubes from untrimmed videos, where the object may only exist in a very small segment of the video; (2) We deal with multi-form sentences, including the declarative sentences with explicit objects and interrogative sentences with unknown objects. Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of object relationship modeling. Thus, we then propose a novel Spatio-Temporal Graph Reasoning Network (STGRN) for this task. First, we build a spatio-temporal region graph to capture the region relationships with temporal object dynamics, which involves the implicit and explicit spatial subgraphs in each frame and the temporal dynamic subgraph across frames. We then incorporate textual clues into the graph and develop the multi-step cross-modal graph reasoning. Next, we introduce a spatio-temporal localizer with a dynamic selection method to directly retrieve the spatio-temporal tubes without tube pre-generation. Moreover, we contribute a large-scale video grounding dataset VidSTG based on video relation dataset VidOR. The extensive experiments demonstrate the effectiveness of our method.

机译：在本文中，我们考虑了一项新颖的任务，即针对多形式句子的时空视频接地（STVG）。给定未修剪的视频和描述对象的陈述性/疑问句，STVG旨在定位所查询对象的时空管。 STVG具有两个具有挑战性的设置：（1）我们需要从未修剪的视频中定位时空对象管，其中对象可能只存在于视频的一小部分; （2）我们处理多种形式的句子，包括带有显式宾语的陈述性句子和带有未知宾语的疑问句。由于无效的管子预生成和缺乏对象关系建模，现有方法无法解决STVG任务。因此，我们为此任务提出了一种新颖的时空图推理网络（STGRN）。首先，我们建立一个时空区域图，以捕获具有时间对象动力学的区域关系，其中涉及每个帧中的隐式和显式空间子图以及跨帧的时间动态子图。然后，我们将文本线索合并到图中，并开发多步跨模态图推理。接下来，我们介绍一种采用动态选择方法的时空定位器，以直接检索时空管，而无需生成管。此外，我们基于视频关系数据集VidOR贡献了大规模的视频接地数据集VidSTG。大量的实验证明了我们方法的有效性。

著录项

来源
《IEEE/CVF Conference on Computer Vision and Pattern Recognition》|2020年|10665-10674|共10页
会议地点
作者
Zhu Zhang; Zhou Zhao; Yang Zhao; Qi Wang; Huasheng Liu; Lianli Gao;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Electron tubes; Grounding; Task analysis; Visualization; Cognition; Feature extraction; Natural languages;

机译：电子管;接地;任务分析;可视化;认知;特征提取;自然语言;

相似文献

外文文献
中文文献
专利

1. A stereoscopic video conversion scheme based on spatio-temporal analysis of MPEG videos [J] . Guo-Shiang Lin, Hsiang-Yun Huang, Wei-Chih Chen, EURASIP journal on advances in signal processing . 2012,第1期

机译：一种基于MPEG视频的时空分析的立体视频转换方案
2. Strong systematicity through sensorimotor conceptual grounding: an unsupervised, developmental approach to connectionist sentence processing [J] . Peter A. Jansen, Scott Watter Connection Science . 2012,第1期

机译：通过感觉运动概念的扎实实现强大的系统性：一种无监督的发展方法，用于连接句处理
3. VideoRay Releases Video and Still Images of Ship Grounding [J] . Sea Technology Group Sea Technology: Worldwide Information Leader for Marine Business, Science & Engineering . 2006,第5期

机译：VideoRay发布船舶停飞的视频和静止图像
4. Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos [C] . Yitian Yuan, Lin Ma, Jingwen Wang, Conference on Neural Information Processing Systems . 2020

机译：视频句子接地的语义条件动态调制
5. Does UDL Exist in the Wild: Initial Study Based on Observations of Videos of Instruction [D] . Hunt, Cassandra L. 2020

机译：UDL是否存在于野外：基于指示视频的观察的初始研究
6. Spatio-Temporal Action Detection in Untrimmed Videos by Using Multimodal Features and Region Proposals [O] . Yeongtaek Song, Incheol Kim 2019

机译：利用多峰特征和区域提议检测未修剪视频中的时空行为
7. Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video [O] . Zhenfang Chen, Lin Ma, Wenhan Luo, 2019

机译：弱监督的时空 - 在视频中跨越自然句子

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

摘要

著录项

相似文献

相关主题

期刊订阅