An electronic device obtains video content and a textual query associated with a video moment in the video content. The video content is divided video segments, and the textual query includes one or more words. Visual features are extracted for each video segment, and textual features are extracted for each word. The visual features and the textual features are combined to generate a similarity matrix in which each element represents a similarity level between a respective video segment and a respective word. Segment-attended sentence features are generated for the textual query based on the textual features and the similarity matrix. The segment-attended sentence features are combined with the visual features of the video segments to determine a plurality of alignment scores, which is used to retrieve a subset of the video content associated with the textual query to be retrieved from the video segments.
展开▼