首页> 外文期刊>IEEE transactions on multimedia >Unified Spatio-Temporal Attention Networks for Action Recognition in Videos
【24h】

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos

机译:统一的时空注意力网络,用于视频中的动作识别

获取原文
获取原文并翻译 | 示例

摘要

Recognizing actions in videos is not a trivial task because video is an information-intensive media and includes multiple modalities. Moreover, on each modality, an action may only appear at some spatial regions, or only part of the temporal video segments may contain the action. A valid question is how to locate the attended spatial areas and selective video segments for action recognition. In this paper, we devise a general attention neural cell, called AttCell, that estimates the attention probability not only at each spatial location but also for each video segment in a temporal sequence. With AttCell, a unified Spatio-Temporal Attention Networks (STAN) is proposed in the context of multiple modalities. Specifically, STAN extracts the feature map of one convolutional layer as the local descriptors on each modality and pools the extracted descriptors with the spatial attention measured by AttCell as a representation of each segment. Then, we concatenate the representation on each modality to seek a consensus on the temporal attention, a priori, to holistically fuse the combined representation of video segments to the video representation for recognition. Our model differs from conventional deep networks, which focus on the attention mechanism, because our temporal attention provides a principled and global guidance across different modalities and video segments. Extensive experiments are conducted on four public datasets; UCF101, CCV, THUMOS14, and Sports-1M; our STAN consistently achieves superior results over several state-of-the-art techniques. More remarkably, we validate and demonstrate the effectiveness of our proposal when capitalizing on the different number of modalities.
机译:识别视频中的动作并不是一件容易的事,因为视频是一种信息密集型媒体,并且包含多种形式。而且,在每个模态上,动作可以仅出现在某些空间区域,或者仅一部分时间视频片段可以包含该动作。一个有效的问题是如何找到参与的空间区域和选择性视频片段以进行动作识别。在本文中,我们设计了一个称为AttCell的通用注意力神经元,它不仅可以估计每个空间位置的注意力概率,还可以估计时间序列中每个视频片段的注意力概率。借助AttCell,在多种模式的背景下,提出了统一的时空注意力网络(STAN)。具体来说,STAN提取一个卷积层的特征图作为每个模态上的局部描述符,并以AttCell测量的空间注意力作为每个段的表示,合并提取的描述符。然后,我们将每个模态上的表示进行级联,以寻求时间关注度的先验共识,从而将视频片段的组合表示整体融合到视频表示中进行识别。我们的模型不同于传统的深层网络,后者专注于注意力机制,因为我们的时间注意力在不同的模式和视频片段之间提供了有原则的全局指导。在四个公共数据集上进行了广泛的实验; UCF101,CCV,THUMOS14和Sports-1M;我们的STAN始终比几种最先进的技术取得优异的结果。更值得注意的是,当利用不同数量的模式时,我们验证并证明了我们提案的有效性。

著录项

  • 来源
    《IEEE transactions on multimedia》 |2019年第2期|416-428|共13页
  • 作者单位

    Univ Sci & Technol China, Hefei 230000, Anhui, Peoples R China|Univ Sci & Technol China, Dept Elect Engn & Informat Sci, Hefei 230000, Anhui, Peoples R China;

    Microsoft Res, Multimedia Search & Mining Grp, Beijing 100080, Peoples R China;

    Peking Univ, Natl Engn Lab Video Technol, Sch Elect Engn & Comp Sci, Beijing 100080, Peoples R China;

    JD AI Res, Beijing 100101, Peoples R China|JD AI Res, Comp Vis & Multimedia Lab, Beijing 100101, Peoples R China;

    Lenovo, Beijing 100085, Peoples R China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Action recognition; spatio-temporal attention; deep convolutional networks;

    机译:动作识别;时空注意;深度卷积网络;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号