An extraction model is constituted of an encoder that extracts a feature amount of a first image of a first representation format to derive a feature map of the first image, a first decoder that derives a second virtual image of a second representation format different from the representation format of the first image on the basis of the feature map, a first discriminator that discriminates a representation format of an input image and whether the input image is a real image or a virtual image, and outputs a first discrimination result, a second decoder that extracts a region of interest of the first image on the basis of the feature map, and a second discriminator that discriminates whether an extraction result of the region of interest by the second decoder is an extraction result of a first image with ground-truth mask or an extraction result of a first image without ground-truth mask, and outputs a second discrimination result.
展开▼