This application is directed to retrieving a video moment based on text description. An electronic device obtains video content and text description associated with the video moment. The video content includes a plurality of video segments, and the text description including one or more sentences. A plurality of visual features are extracted for the video segments of the video content, and one or more textual features are extracted for the one or more sentences in the text description. The visual features of the plurality of video segments and the textual features of the one or more sentences are combined to generate a plurality of alignment scores. Based on the alignment scores, the electronic device retrieves a subset of the video content from the video segments for the text description.
展开▼