...
首页> 外文期刊>IEEE transactions on multimedia >Where-and-When to Look: Deep Siamese Attention Networks for Video-Based Person Re-Identification
【24h】

Where-and-When to Look: Deep Siamese Attention Networks for Video-Based Person Re-Identification

机译:在哪里 - 和何时看的:深度暹罗关注网络用于视频的人重新识别

获取原文
获取原文并翻译 | 示例
           

摘要

Video-based person re-identification (re-id) is a central application in surveillance systems with a significant concern in security. Matching persons across disjoint camera views in their video fragments are inherently challenging due to the large visual variations and uncontrolled frame rates. There are two steps crucial to person re-id, namely, discriminative feature learning and metric learning. However, existing approaches consider the two steps independently, and they do not make full use of the temporal and spatial information in the videos. In this paper, we propose a Siamese attention architecture that jointly learns spatiotemporal video representations and their similarity metrics. The network extracts local convolutional features from regions of each frame and enhances their discriminative capability by focusing on distinct regions when measuring the similarity with another pedestrian video. The attention mechanism is embedded into spatial gated recurrent units to selectively propagate relevant features and memorize their spatial dependencies through the network. The model essentially learns which parts (where) from which frames (when) are relevant and distinctive for matching persons and attaches higher importance therein. The proposed Siamese model is end-to-end trainable to jointly learn comparable hidden representations for paired pedestrian videos and their similarity value. Extensive experiments on three benchmark datasets show the effectiveness of each component of the proposed deep network while outperforming state-of-the-art methods.
机译:基于视频的人重新识别(RE-ID)是监控系统中的核心应用,具有重要的安全性。由于较大的视觉变化和不受控制的帧速率,在其视频片段中跨越相机视图的匹配人本质上是具有挑战性的。人员重新ID有两个步骤,即辨别特征学习和度量学习。但是,现有方法独立地考虑两步,并且他们不会充分利用视频中的时间和空间信息。在本文中,我们提出了一个暹罗注意力架构,共同学习时空视频表示及其相似度指标。该网络从每个帧的区域提取局部卷积特征,并通过在测量与另一行人视频的相似性时聚焦不同的区域来增强它们的辨别能力。注意机制嵌入到空间门控复发单元中,以选择性地传播相关特征并通过网络记住其空间依赖性。该模型基本上了解哪些部分(其中)帧(当时)与匹配人员相关的和独特,并附加在其中更高的重要性。拟议的暹罗模型是结束培训,共同学习配对的行人视频及其相似值的可比隐藏表示。三个基准数据集的广泛实验显示了所提出的深网络每个组件的有效性,同时表现优于现有的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号