The rapid growth of video data aggravates the effort by viewers in exploring informative data. This paper presents a framework based on contrastive learning for unsupervised video summarization to help people to extract important parts in those videos. In contrastive learning, anchor-positive and anchor-negative pairs are usually employed to fulfill learning deep representation from the anchor. In our study, a positive sample by reversing the anchor video is introduced, whose summarization should also be a reversed one. Meanwhile, by destroying temporal relations in the anchor video, the intra-negative video is generated, whose summarization should be quite different from the anchor. Finally, we design our framework to explore the similarity and differences of such samples with the anchor by two proposed summary losses. Experimental evaluations on two benchmark datasets show that our proposed framework surpasses the state-of-the-art unsupervised methods in terms of F-score and correlation coefficients. Without using any annotation, our method can even outperform many supervised methods. We also show that our framework can further enhance the summarization performance by training on large-scale external data that are collected from social networks. Quantitative experiments also show that our method can be integrated into other models with better performance and quicker convergence, indicating the generality of the algorithm.
展开▼