首页> 外文会议>IEEE/CVF Conference on Computer Vision and Pattern Recognition >Appearance-and-Relation Networks for Video Classification
【24h】

Appearance-and-Relation Networks for Video Classification

机译:视频分类的外观和关系网络

获取原文

摘要

Spatiotemporal feature learning in videos is a fundamental problem in computer vision. This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner. ARTNets are constructed by stacking multiple generic building blocks, called as SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner. Specifically, SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling. The appearance branch is implemented based on the linear combination of pixels or filter responses in each frame, while the relation branch is designed based on the multiplicative interactions between pixels or filter responses across multiple frames. We perform experiments on three action recognition benchmarks: Kinetics, UCF101, and HMDB51, demonstrating that SMART blocks obtain an evident improvement over 3D convolutions for spatiotemporal feature learning. Under the same training setting, ARTNets achieve superior performance on these three datasets to the existing state-of-the-art methods.
机译:视频中的时空特征学习是计算机视觉的基本问题。本文提出了一种称为外观与关系网络(ARTNet)的新架构,以端到端的方式学习视频表示。 ARTNet是通过堆叠称为SMART的多个通用构造块来构造的,其目标是以单独和显式的方式同时对RGB输入的外观和关系进行建模。具体来说,SMART块将时空学习模块解耦为用于空间建模的外观分支和用于时间建模的关系分支。外观分支是基于每个帧中像素或过滤器响应的线性组合而实现的,而关系分支是基于像素或跨多个帧的过滤器响应之间的乘法相互作用来设计的。我们在三个动作识别基准上进行了实验:动力学,UCF101和HMDB51,证明了SMART块相对于时空特征学习的3D卷积具有明显的改进。在相同的培训环境下,ARTNet在这三个数据集上的性能要优于现有的最新方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号