首页> 外文期刊>Electronic Letters on Computer Vision and Image Analysis: ELCVIA >From pixels to gestures: learning visual representations for human analysis in color and depth data sequences
【24h】

From pixels to gestures: learning visual representations for human analysis in color and depth data sequences

机译:从像素到手势:学习视觉表示以进行颜色和深度数据序列中的人体分析

获取原文
           

摘要

The visual analysis of humans from images is an important?topic of interest due to ?its relevance to many computer vision applications likepedestrian detection, monitoring and surveillance, human-computer?interaction, e-health or content-based image retrieval, among others. In this dissertation we are interested in learning different visual?representations of the human body that are helpful for?the visual analysis of humans in images and video sequences.?To that end, we analyze both RGB and depth image modalities ?and address the problem from three different research lines, ?at different levels of abstraction; from pixels to gestures:?human segmentation, human?pose estimation and gesture recognition. First, we show how binary segmentation (object vs. background) ?of the human body in image sequences is helpful to remove all the background clutter?present in the scene. The presented method, based on Graph cuts optimization,?enforces spatio-temporal consistency of the produced segmentation masks?among consecutive frames. Secondly, we present a framework for multi-label segmentation?for obtaining much more detailed segmentation masks: instead of just obtaining?a binary representation separating the human body from the background,?finer segmentation masks can be obtained separating the different body parts. At a higher level of abstraction, we?aim for a simpler yet descriptive representation?of the human body. Human pose estimation methods?usually rely on skeletal models of the human?body, formed by segments (or rectangles) that represent the?body limbs, appropriately connected following the?kinematic constraints of the human body.?In practice, such skeletal models must?fulfill some constraints in order to allow ?for efficient inference, while actually limiting?the expressiveness of the model.?In order to cope with this, we introduce a top-down approach for predicting the position?of the body parts in the model, using a mid-level part representation?based on Poselets. Finally, we propose a framework for gesture recognition based on the?bag of visual words framework. We leverage the benefits of RGB and depth?image modalities by combining modality-specific visual vocabularies in?a late fusion fashion. A new rotation-variant depth descriptor is presented,?yielding better results than other state-of-the-art descriptors.?Moreover, spatio-temporal pyramids are used to ?encode rough spatial and temporal structure.?In addition, we present a probabilistic reformulation of Dynamic Time Warping?for gesture segmentation in video sequences. A Gaussian-based?probabilistic model of a gesture is learnt, implicitly encoding?possible deformations in both spatial and time domains.
机译:从图像对人类进行视觉分析是一个重要的主题,因为它与许多计算机视觉应用(如行人检测,监视和监视,人机交互,电子医疗或基于内容的图像检索等)相关。在这篇论文中,我们有兴趣学习对人体的各种视觉表示,这有助于对图像和视频序列中的人类进行视觉分析。为此,我们分析了RGB和深度图像模态,并解决了这个问题。来自三个不同的研究领域,处于不同的抽象水平;从像素到手势:“人体分割,人体姿势估计和手势识别”。首先,我们展示图像序列中人体的二值分割(对象与背景)如何有助于消除场景中存在的所有背景杂波。提出的方法基于图割优化,可增强连续帧之间生成的分割蒙版的时空一致性。其次,我们提出了一种多标签分割的框架-用于获取更详细的分割蒙版:不仅可以获取将人体与背景分离的二进制表示,还可以使用更精细的分割蒙版来分离不同的身体部位。在更高的抽象水平上,我们旨在对人体进行更简单但更具描述性的表示。人体姿势估计方法通常依赖于人体的骨骼模型,骨骼模型是由代表人体四肢的部分(或矩形)组成,并根据人体的运动学约束进行适当连接。在实践中,此类骨骼模型必须“满足一些约束条件以允许有效的推理,而实际上限制了模型的表达性。”为了解决这个问题,我们引入了一种自顶向下的方法来预测模型中身体部位的位置。 ,使用基于Poselets的中级零件表示法。最后,我们提出了一种基于视觉单词袋框架的手势识别框架。我们通过以后期融合的方式组合特定于形式的视觉词汇来利用RGB和深度图像形式的优势。提出了一种新的旋转变量深度描述符,其结果要比其他最新的描述符更好。此外,时空金字塔被用来对大致的时空结构进行编码。动态时间扭曲的概率重新表示形式,用于视频序列中的手势分割。学习基于手势的高斯概率模型,隐式编码空间和时域中可能的变形。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号