In automatic video summarization, visual summary is constructed typically based on the analysis of low-level features with little consideration of video semantics. However, the contextual and semantic information of a video is marginally related to low-level features in practice although they are useful to compute visual similarity between frames. Therefore, we propose a novel video summarization technique, where the semantically important information is extracted from a set of keyframes given by human and the summary of a video is constructed based on the automatic temporal segmentation using the analysis of inter-frame similarity to the keyframes. Toward this goal, we model a video sequence with a dissimilarity matrix based on bidirectional similarity measure between every pair of frames, and subsequently characterize the structure of the video by a nonlinear manifold embedding. Then, we formulate video summarization as a variant of the 0–1 knapsack problem, which is solved by dynamic programming efficiently. The effectiveness of our algorithm is illustrated quantitatively and qualitatively using realistic videos collected from YouTube.
展开▼