This paper addresses the problem of supervised video summarization byformulating it as a sequence-to-sequence learning problem, where the input is asequence of original video frames, the output is a keyshot sequence. Our keyidea is to learn a deep summarization network with attention mechanism to mimicthe way of selecting the keyshots of human. To this end, we propose a novelvideo summarization framework named Attentive encoder-decoder networks forVideo Summarization (AVS), in which the encoder uses a Bidirectional LongShort-Term Memory (BiLSTM) to encode the contextual information among the inputvideo frames. As for the decoder, two attention-based LSTM networks areexplored by using additive and multiplicative objective functions,respectively. Extensive experiments are conducted on three video summarizationbenchmark datasets, i.e., SumMe, TVSum, and YouTube. The results demonstratethe superiority of the proposed AVS-based approaches against thestate-of-the-art approaches, with remarkable improvements from 3% to 11% on thethree datasets, respectively.
展开▼