Video Description aims to automatically generate descriptive natural language for videos.Due to the large volume of multi-modal data and successful implementations of Deep Neural Networks(DNNs),a wide range of models have been proposed.However,previous models learn insufficient linguistic information or correlation between visual and textual modalities.In order to address those problems,this paper proposes an integrated model using Long Short-Term Memory(LSTM).This proposed model consists of triple channels in parallel:a primary video description channel,a sentence-to-sentence channel for language learning,and a channel to integrate visual and textual information.Additionally,the parallel three channels are connected by LSTM weight matrices during training.The VD-ivt model is evaluated on two publicly available datasets,i.e.Youtube2 Text and LSMDC.Experimental results demonstrate that the performance of the proposed model outperforms those benchmarks.
展开▼
机译:Some Characteristics and Innovative Development Countermeasures of Short Weather Video:A Case Study of the First Award-winning Excellent Short Weather Video in Hubei Province