首页> 外文会议>IEEE/CVF Conference on Computer Vision and Pattern Recognition >Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences
【24h】

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

机译:它存在于何处:时态视频为多种形式的句子奠定基础

获取原文

摘要

In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG). Given an untrimmed video and a declarative/interrogative sentence depicting an object, STVG aims to localize the spatio-temporal tube of the queried object. STVG has two challenging settings: (1) We need to localize spatio-temporal object tubes from untrimmed videos, where the object may only exist in a very small segment of the video; (2) We deal with multi-form sentences, including the declarative sentences with explicit objects and interrogative sentences with unknown objects. Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of object relationship modeling. Thus, we then propose a novel Spatio-Temporal Graph Reasoning Network (STGRN) for this task. First, we build a spatio-temporal region graph to capture the region relationships with temporal object dynamics, which involves the implicit and explicit spatial subgraphs in each frame and the temporal dynamic subgraph across frames. We then incorporate textual clues into the graph and develop the multi-step cross-modal graph reasoning. Next, we introduce a spatio-temporal localizer with a dynamic selection method to directly retrieve the spatio-temporal tubes without tube pre-generation. Moreover, we contribute a large-scale video grounding dataset VidSTG based on video relation dataset VidOR. The extensive experiments demonstrate the effectiveness of our method.
机译:在本文中,我们考虑了一项新颖的任务,即针对多形式句子的时空视频接地(STVG)。给定未修剪的视频和描述对象的陈述性/疑问句,STVG旨在定位所查询对象的时空管。 STVG具有两个具有挑战性的设置:(1)我们需要从未修剪的视频中定位时空对象管,其中对象可能只存在于视频的一小部分; (2)我们处理多种形式的句子,包括带有显式宾语的陈述性句子和带有未知宾语的疑问句。由于无效的管子预生成和缺乏对象关系建模,现有方法无法解决STVG任务。因此,我们为此任务提出了一种新颖的时空图推理网络(STGRN)。首先,我们建立一个时空区域图,以捕获具有时间对象动力学的区域关系,其中涉及每个帧中的隐式和显式空间子图以及跨帧的时间动态子图。然后,我们将文本线索合并到图中,并开发多步跨模态图推理。接下来,我们介绍一种采用动态选择方法的时空定位器,以直接检索时空管,而无需生成管。此外,我们基于视频关系数据集VidOR贡献了大规模的视频接地数据集VidSTG。大量的实验证明了我们方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号