首页> 外文OA文献 >Towards Action Recognition and Localization in Videos with Weakly Supervised Learning
【2h】

Towards Action Recognition and Localization in Videos with Weakly Supervised Learning

机译:在弱监督学习中实现视频的动作识别和定位

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Human behavior understanding is a fundamental problem of computer vision. It is an important component of numerous real-life applications, such as human-computer interaction, sports analysis, video search, and many others. In this thesis we work on the problem of action recognition and localization, which is a crucial part of human behavior understanding. Action recognition explains what a human is doing in the video, while action localization indicates where and when in the video the action is happening. We focus on two important aspects of the problem: (1) capturing intra-class variation of action categories and (2) inference of action location. Manual annotation of videos with fine-grained action labels and spatio-temporal action locations is a nontrivial task, thus employing weakly supervised learning approaches is of interest. Real-life actions are complex, and the same action can look different in different scenarios. A single template is not capable of capturing such data variability. Therefore, for each action category we automatically discover small clusters of examples that are visually similar to each other. A separate classifier is learnt for each cluster, so that more class variability is captured. In addition, we establish a direct association between a novel test example and examples from training data and demonstrate how metadata (e.g., attributes) can be transferred to test examples. Weakly supervised learning for action recognition and localization is another challenging task. It requires automatic inference of action location for all the training videos during learning. Initially, we simplify this problem and try to find discriminative regions in videos that lead to a better recognition performance. The regions are inferred in a manner such that they are visually similar across all the videos of the same category. Ideally, the regions should correspond to the action location; however, there is a gap between inferred discriminative regions and semantically meaningful regions representing action location. To fill the gap, we incorporate human eye gaze data to drive the inference of regions during learning. This allows inferring regions that are both discriminative and semantically meaningful. Furthermore, we use the inferred regions and learnt action model to assist top-down eye gaze prediction.
机译:人类行为理解是计算机视觉的基本问题。它是许多现实生活应用程序的重要组成部分,例如人机交互,运动分析,视频搜索等。在本文中,我们研究了动作识别和定位问题,这是人类行为理解的关键部分。动作识别说明了人们在视频中的行为,而动作本地化则表明了在视频中的位置和时间。我们关注该问题的两个重要方面:(1)捕获动作类别的类内变化;(2)推断动作位置。手动标注具有细粒度动作标签和时空动作位置的视频是一项艰巨的任务,因此采用弱监督学习方法非常有趣。现实生活中的动作非常复杂,同一动作在不同情况下看起来可能有所不同。单个模板无法捕获此类数据可变性。因此,对于每个动作类别,我们会自动发现在视觉上相似的小型示例集群。为每个聚类学习一个单独的分类器,以便捕获更多的类可变性。此外,我们在新颖的测试示例和来自训练数据的示例之间建立了直接关联,并演示了如何将元数据(例如属性)转移到测试示例中。用于动作识别和本地化的弱监督学习是另一项艰巨的任务。它要求在学习过程中自动推断所有训练视频的动作位置。最初,我们简化了此问题,并尝试在视频中找到有区别的区域,从而带来更好的识别性能。以这样的方式推断区域,使得它们在相同类别的所有视频上在视觉上相似。理想情况下,区域应对应于动作位置;但是,在推断出的区别区域与代表动作位置的语义上有意义的区域之间存在差距。为了填补空白,我们结合了人眼注视数据,以在学习过程中推动区域推论。这允许推断出具有区分性和语义意义的区域。此外,我们使用推断的区域和学习的动作模型来辅助自上而下的视线预测。

著录项

  • 作者

    Shapovalova Nataliya;

  • 作者单位
  • 年度 2014
  • 总页数
  • 原文格式 PDF
  • 正文语种
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号