A video stream processing method and system through deep learning-based image captioning are disclosed. According to an aspect of the present invention, a video stream processing method through deep learning-based image captioning includes: dividing, by the system, a video to be edited into a plurality of shots, by the system, at least one shot for each of the divided shots generating text; receiving, by the system, a search condition text for image search; and selecting, by the system, matching shot text corresponding to the search condition text, and extracting a matching image corresponding to the selected matching shot text. and generating, by the system, at least one shot text for each of the divided shots, determining at least one selection frame from among a plurality of frames included in each of the divided shots; and generating the shot text through image captioning corresponding to the determined selection frame.
展开▼