首页> 外文期刊>Circuits and Systems for Video Technology, IEEE Transactions on >Efficient Convolutional Neural Networks for Depth-Based Multi-Person Pose Estimation
【24h】

Efficient Convolutional Neural Networks for Depth-Based Multi-Person Pose Estimation

机译:基于深度的多人姿态估计的高效卷积神经网络

获取原文
获取原文并翻译 | 示例

摘要

Achieving robust multi-person 2D body landmark localization and pose estimation is essential for human behavior and interaction understanding as encountered for instance in HRI settings. Accurate methods have been proposed recently, but they usually rely on rather deep Convolutional Neural Network (CNN) architecture, thus requiring large computational and training resources. In this paper, we investigate different architectures and methodologies to address these issues and achieve fast and accurate multi-person 2D pose estimation. To foster speed, we propose to work with depth images, whose structure contains sufficient information about body landmarks while being simpler than textured color images and thus potentially requiring less complex CNNs for processing. In this context, we make the following contributions. i) we study several CNN architecture designs combining pose machines relying on the cascade of detectors concept with lightweight and efficient CNN structures; ii) to address the need for large training datasets with high variability, we rely on semi-synthetic data combining multi-person synthetic depth data with real sensor backgrounds; iii) we explore domain adaptation techniques to address the performance gap introduced by testing on real depth images; iv) to increase the accuracy of our fast lightweight CNN models, we investigate knowledge distillation at several architecture levels which effectively enhance performance. Experiments and results on synthetic and real data highlight the impact of our design choices, providing insights into methods addressing standard issues normally faced in practical applications, and resulting in architectures effectively matching our goal in both performance and speed.
机译:实现强大的多人2D身体地标本地化和姿势估计对于例如在HRI设置中遇到的人类行为和交互理解至关重要。最近提出了准确的方法,但它们通常依赖于相当深的卷积神经网络(CNN)架构,从而需要大的计算和培训资源。在本文中,我们调查了不同的架构和方法来解决这些问题,实现快速准确的多人2D姿态估计。为了促进速度,我们建议使用深度图像,其结构包含有关身体地标的足够信息,同时比纹理彩色图像更简单,因此可能需要更少的复杂CNN来处理。在这种情况下,我们进行以下贡献。 i)我们研究了几个CNN架构设计,将姿势机依赖于探测器概念的级联,轻便和高效的CNN结构。 ii)为了满足具有高可变性的大型培训数据集的需求,我们依靠半合成数据与真实传感器背景相结合的多人综合深度数据; iii)我们探索域适应技术,以解决在真实深度图像上测试引入的性能差距; iv)为了提高我们快速轻量级CNN模型的准确性,我们调查了几种结构水平的知识蒸馏,有效提高了性能。合成和实际数据的实验和结果突出了我们的设计选择的影响,提供了解解决实际应用中正常面临的标准问题的方法的洞察力,导致架构有效地匹配了我们的性能和速度的目标。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号