Learning and parsing video events with goal and intent prediction

Mingtao Pei; Zhangzhang Si; Benjamin Z Yao; Song-Chun Zhu

首页> 外文期刊>Computer vision and image understanding >Learning and parsing video events with goal and intent prediction

【24h】

Learning and parsing video events with goal and intent prediction

机译：通过目标和意图预测来学习和解析视频事件

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we present a framework for parsing video events with stochastic Temporal And-Or Graph (T-AOG) and unsupervised learning of the T-AOG from video. This T-AOG represents a stochastic event grammar. The alphabet of the T-AOG consists of a set of grounded spatial relations including the poses of agents and their interactions with objects in the scene. The terminal nodes of the T-AOG are atomic actions which are specified by a number of grounded relations over image frames. An And-node represents a sequence of actions. An Or-node represents a number of alternative ways of such concatenations. The And-Or nodes in the T-AOG can generate a set of valid temporal configurations of atomic actions, which can be equivalently represented as the language of a stochastic context-free grammar (SCFG). For each And-node we model the temporal relations of its children nodes to distinguish events with similar structures but different temporal patterns and interpolate missing portions of events. This makes the T-AOG grammar context-sensitive. We propose an unsupervised learning algorithm to learn the atomic actions, the temporal relations and the And-Or nodes under the information projection principle in a coherent probabilistic framework. We also propose an event parsing algorithm based on the T-AOG which can understand events, infer the goal of agents, and predict their plausible intended actions. In comparison with existing methods, our paper makes the following contributions.(ⅰ)We represent events by a T-AOG with hierarchical compositions of events and the temporal relations between the sub-events.(ⅱ) We learn the grammar, including atomic actions and temporal relations, automatically from the video data without manual supervision.(ⅲ)Our algorithm infers the goal of agents and predicts their intents by a top-down process, handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework.(ⅳ)The algorithm uses event context to improve the detection of atomic actions, segment and recognize objects in the scene. Extensive experiments, including indoor and out door scenes, single and multiple agents events, are conducted to validate the effectiveness of the proposed approach.

机译：在本文中，我们提供了一个用于解析具有随机时间和或图（T-AOG）的视频事件以及从视频中无监督学习T-AOG的框架。该T-AOG表示随机事件语法。 T-AOG的字母由一组基础的空间关系组成，包括代理的姿势及其与场景中对象的交互。 T-AOG的终端节点是原子动作，由图像帧上的多个接地关系指定。 And节点表示一系列操作。 Or节点表示这种连接的许多替代方式。 T-AOG中的And-Or节点可以生成原子动作的一组有效时间配置，这些配置可以等效地表示为随机上下文无关文法（SCFG）的语言。对于每个“与”节点，我们对其子节点的时间关系建模，以区分结构相似但时间模式不同的事件，并内插事件的缺失部分。这使得T-AOG语法对上下文敏感。我们提出了一种无监督的学习算法，以一种相干的概率框架在信息投影原理下学习原子动作，时间关系和“与”或“或”节点。我们还提出了一种基于T-AOG的事件解析算法，该算法可以了解事件，推断代理的目标并预测其合理的预期动作。与现有方法相比，本文做出了以下贡献：（ⅰ）我们用T-AOG表示事件，它具有事件的层次结构以及子事件之间的时间关系。（ⅱ）学习语法，包括原子动作（ⅲ）我们的算法通过自上而下的过程推断代理的目标并预测其意图，处理事件插入和多代理事件，并保留对视频的所有可能解释该算法利用事件上下文来提高对原子动作的检测，分割和识别场景中的物体。进行了广泛的实验，包括室内和室外场景，单个和多个特工事件，以验证所提出方法的有效性。

著录项

来源
《Computer vision and image understanding》 |2013年第10期|1369-1383|共15页
作者
Mingtao Pei; Zhangzhang Si; Benjamin Z Yao; Song-Chun Zhu;
展开▼
作者单位

Beijing Lab of Intelligent Information Technology, Beijing Institute of Technology, China,Department of Statistics, University of California, Los Angeles, United States;

Department of Statistics, University of California, Los Angeles, United States;

Department of Statistics, University of California, Los Angeles, United States;

Department of Statistics, University of California, Los Angeles, United States;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Temporal And-Or Graph (T-AOG); Event parsing; Unsupervised learning; Goal prediction; Information projection;

机译：时间与或图（T-AOG）;事件解析;无监督学习;目标预测;信息投影;

相似文献

外文文献
中文文献
专利

1. Joint Video and Text Parsing for Understanding Events and Answering Queries [J] . Tu Kewei, Meng Meng, Lee Mun Wai, IEEE multimedia . 2014,第2期

机译：联合视频和文本解析，以了解事件和回答查询
2. Video events recognition by improved stochastic parsing based on extended stochastic context-free grammar representation [J] . 曹茂永, 赵猛, 裴明涛, 北京理工大学学报：英文版 . 2013,第001期

机译：通过基于扩展随机上下文无关文法表示的改进的随机解析来识别视频事件
3. ANALYSIS OF VISION BASED SYSTEMS TO DETECT REAL TIME GOAL EVENTS IN SOCCER VIDEOS [J] . Tanzila Saba, Ayman Altameem Applied Artificial Intelligence . 2013,第5a7期

机译：基于视觉的足球视频实时目标检测系统的分析
4. Parsing video events with goal inference and intent prediction [C] . Pei Mingtao, Yunde Jia, Zhu Song-Chun Computer Vision (ICCV), 2011 IEEE International Conference on . 2011

机译：使用目标推断和意图预测来解析视频事件
5. Learning, detection, representation, indexing and retrieval of multi-agent events in videos. [D] . Hakeem, Asaad. 2007

机译：视频中多主体事件的学习，检测，表示，索引和检索。
6. Learning Goals May Prevent Goals Gone Wild [O] . Justin K. Benzer, Suzannah K. Creech, David C. Mohr, 2014

机译：学习目标可能会阻止疯狂目标
7. Learning and Parsing Video Events with Goal and Intent Prediction [O] . Mingtao Pei, Zhangzhang Si, Benjamin Yao, 2012

机译：使用目标和意图预测学习和解析视频事件
8. Learning and Parsing Video Events with Goal and Intent Prediction. [R] . Pei, M., Si, Z., Yao, B., 2012

机译：学习和解析具有目标和意图预测的视频事件。

Learning and parsing video events with goal and intent prediction

摘要

著录项

相似文献

相关主题

期刊订阅