Visual identification of an individual in a crowded environment observed by a distributed camera network is critical to a variety of tasks including commercial space management, border control, and crime prevention. Automatic re-identification of a human from public space CCTV video is challenging due to spatiotemporal visual feature variations and strong visual similarity in people's appearance, compounded by low-resolution and poor quality video data. Relying on re-identification using a probe image is limiting, as a linguistic description of an individual's profile may often be the only available cues. In this work, we show how mid-level semantic attributes can be used synergistically with low-level features for both identification and re-identification. Specifically, we learn an attribute-centric representation to describe people, and a metric for comparing attribute profiles to disambiguate individuals. This differs from existing approaches to re-identification which rely purely on bottom-up statistics of low-level features: it allows improved robustness to view and lighting; and can be used for identification as well as re-identification. Experiments demonstrate the flexibility and effectiveness of our approach compared to existing feature representations when applied to benchmark datasets.
展开▼