Embodiments relate to video action segmentation by mixed temporal domain adaption. A computer-implemented method for training a video segmentation system for assigning a set of action labels of frames of a video is provided. The method for training a video segmentation system includes inputting, for each input video from a first set of video data and from a second set of video data, a set of frame-level features of frames of the input video into a video segmentation network; outputting, for each input video from the first set of video data and from the second set of video data, a final set of frame-level predictions in which each frame of at least some of the frames from a set of frames of the input video has an associated label prediction; computing losses for the video segmentation network; updating the video segmentation network using the computed losses..
展开▼