A video editing method and system using deep learning-based image captioning are disclosed. According to an aspect of the present invention, the video editing method includes: generating, by an editing system, frame texts corresponding to each of a plurality of frame images included in a video to be edited; receiving a selection command capable of selecting one or more selection commands, wherein the selection command includes at least one selection keyword, and the editing system determines a target frame text that is frame texts corresponding to the selection command, and the determined target and determining a target frame image corresponding to the frame text.
展开▼