Nowadays, video surveillance has become ubiquitous with the quick development of artificial intelligence. Multi-object detection (MOD) is a key step in video surveillance and has been widely studied for a long time. The majority of existing MOD algorithms follow the “divide and conquer” pipeline and utilize popular machine learning techniques to optimize algorithm parameters. However, this pipeline is usually suboptimal since it decomposes the MOD task into several sub-tasks and does not optimize them jointly. In addition, the frequently used supervised learning methods rely on the labeled data which are scarce and expensive to obtain. Thus, we propose an end-to-end Unsupervised Multi-Object Detection framework for video surveillance, where a neural model learns to detect objects from each video frame by minimizing the image reconstruction error. Moreover, we propose a Memory-Based Recurrent Attention Network to ease detection and training. The proposed model was evaluated on both synthetic and real datasets, exhibiting its potential.
展开▼