The present disclosure relates to systems, non-transitory computer-readable media, and methods that generate ground truth annotations of target utterances in digital image editing dialogues in order to create a state-driven training data set. In particular, in one or more embodiments, the disclosed systems utilize machine and user defined tags, machine learning model predictions, and user input to generate a ground truth annotation that includes frame information in addition to intent, attribute, object, and/or location information. In at least one embodiment, the disclosed systems generate ground truth annotations in conformance with an annotation ontology that results in fast and accurate digital image editing dialogue annotation.
展开▼