...
首页> 外文期刊>Computer vision and image understanding >Learning and parsing video events with goal and intent prediction
【24h】

Learning and parsing video events with goal and intent prediction

机译:通过目标和意图预测来学习和解析视频事件

获取原文
获取原文并翻译 | 示例
           

摘要

In this paper, we present a framework for parsing video events with stochastic Temporal And-Or Graph (T-AOG) and unsupervised learning of the T-AOG from video. This T-AOG represents a stochastic event grammar. The alphabet of the T-AOG consists of a set of grounded spatial relations including the poses of agents and their interactions with objects in the scene. The terminal nodes of the T-AOG are atomic actions which are specified by a number of grounded relations over image frames. An And-node represents a sequence of actions. An Or-node represents a number of alternative ways of such concatenations. The And-Or nodes in the T-AOG can generate a set of valid temporal configurations of atomic actions, which can be equivalently represented as the language of a stochastic context-free grammar (SCFG). For each And-node we model the temporal relations of its children nodes to distinguish events with similar structures but different temporal patterns and interpolate missing portions of events. This makes the T-AOG grammar context-sensitive. We propose an unsupervised learning algorithm to learn the atomic actions, the temporal relations and the And-Or nodes under the information projection principle in a coherent probabilistic framework. We also propose an event parsing algorithm based on the T-AOG which can understand events, infer the goal of agents, and predict their plausible intended actions. In comparison with existing methods, our paper makes the following contributions.(ⅰ)We represent events by a T-AOG with hierarchical compositions of events and the temporal relations between the sub-events.(ⅱ) We learn the grammar, including atomic actions and temporal relations, automatically from the video data without manual supervision.(ⅲ)Our algorithm infers the goal of agents and predicts their intents by a top-down process, handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework.(ⅳ)The algorithm uses event context to improve the detection of atomic actions, segment and recognize objects in the scene. Extensive experiments, including indoor and out door scenes, single and multiple agents events, are conducted to validate the effectiveness of the proposed approach.
机译:在本文中,我们提供了一个用于解析具有随机时间和或图(T-AOG)的视频事件以及从视频中无监督学习T-AOG的框架。该T-AOG表示随机事件语法。 T-AOG的字母由一组基础的空间关系组成,包括代理的姿势及其与场景中对象的交互。 T-AOG的终端节点是原子动作,由图像帧上的多个接地关系指定。 And节点表示一系列操作。 Or节点表示这种连接的许多替代方式。 T-AOG中的And-Or节点可以生成原子动作的一组有效时间配置,这些配置可以等效地表示为随机上下文无关文法(SCFG)的语言。对于每个“与”节点,我们对其子节点的时间关系建模,以区分结构相似但时间模式不同的事件,并内插事件的缺失部分。这使得T-AOG语法对上下文敏感。我们提出了一种无监督的学习算法,以一种相干的概率框架在信息投影原理下学习原子动作,时间关系和“与”或“或”节点。我们还提出了一种基于T-AOG的事件解析算法,该算法可以了解事件,推断代理的目标并预测其合理的预期动作。与现有方法相比,本文做出了以下贡献:(ⅰ)我们用T-AOG表示事件,它具有事件的层次结构以及子事件之间的时间关系。(ⅱ)学习语法,包括原子动作(ⅲ)我们的算法通过自上而下的过程推断代理的目标并预测其意图,处理事件插入和多代理事件,并保留对视频的所有可能解释该算法利用事件上下文来提高对原子动作的检测,分割和识别场景中的物体。进行了广泛的实验,包括室内和室外场景,单个和多个特工事件,以验证所提出方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号