首页> 外文期刊>IEEE Transactions on Image Processing >Deep Image-to-Video Adaptation and Fusion Networks for Action Recognition
【24h】

Deep Image-to-Video Adaptation and Fusion Networks for Action Recognition

机译:用于动作识别的深度图像到视频自适应和融合网络

获取原文
获取原文并翻译 | 示例

摘要

Existing deep learning methods for action recognition in videos require a large number of labeled videos for training, which is labor-intensive and time-consuming. For the same action, the knowledge learned from different media types, e.g., videos and images, may be related and complementary. However, due to the domain shifts and heterogeneous feature representations between videos and images, the performance of classifiers trained on images may be dramatically degraded when directly deployed to videos. In this paper, we propose a novel method, named Deep Image-to-Video Adaptation and Fusion Networks (DIVAFN), to enhance action recognition in videos by transferring knowledge from images using video keyframes as a bridge. The DIVAFN is a unified deep learning model, which integrates domain-invariant representations learning and cross-modal feature fusion into a unified optimization framework. Specifically, we design an efficient cross-modal similarities metric to reduce the modality shift among images, keyframes and videos. Then, we adopt an autoencoder architecture, whose hidden layer is constrained to be the semantic representations of the action class names. In this way, when the autoencoder is adopted to project the learned features from different domains to the same space, more compact, informative and discriminative representations can be obtained. Finally, the concatenation of the learned semantic feature representations from these three autoencoders are used to train the classifier for action recognition in videos. Comprehensive experiments on four real-world datasets show that our method outperforms some state-of-the-art domain adaptation and action recognition methods.
机译:视频中的行动识别的现有深度学习方法需要大量标记的视频进行培训,这是劳动密集型和耗时的培训。对于相同的行动,从不同媒体类型中学到的知识,例如视频和图像,可能是相关的和互补的。然而,由于视频和图像之间的域移位和异构特征表示,在直接部署到视频时,在图像上训练的分类器的性能可能会显着降低。在本文中,我们提出了一种新的方法,命名为深度图像到视频适应和融合网络(DIVAFN),以通过使用视频密钥架作为桥梁从图像传输知识来增强视频中的动作识别。 Divafn是一个统一的深度学习模型,将域不变的表示学习和跨模型特征融合集成到统一的优化框架中。具体地,我们设计了一个有效的跨模型相似性度量,以减少图像,关键帧和视频之间的模态移位。然后,我们采用AutoEncoder架构,其隐藏图层被约束为Action类名称的语义表示。以这种方式,当采用自动统计器将来自不同域的学习特征投影到相同的空间时,可以获得更紧凑,信息和歧视性的表示。最后,使用这三个自动频率的学习语义特征表示的连接用于训练视频中的动作识别的分类器。四个真实数据集的综合实验表明,我们的方法优于一些最先进的域适应和动作识别方法。

著录项

  • 来源
    《IEEE Transactions on Image Processing》 |2020年第2020期|3168-3182|共15页
  • 作者单位

    Sun Yat Sen Univ Sch Data & Comp Sci Guangzhou 510006 Peoples R China|Xidian Univ Sch Telecommun Engn Xian 710071 Peoples R China;

    Xidian Univ Sch Telecommun Engn Xian 710071 Peoples R China;

    Xidian Univ Sch Telecommun Engn Xian 710071 Peoples R China;

    Northwestern Polytech Univ Sch Comp Sci Xian 710072 Peoples R China;

    Northwestern Polytech Univ Sch Automat Xian 710072 Peoples R China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Action recognition; adaptation; deep learning; fusion;

    机译:行动识别;适应;深入学习;融合;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号