CNN-based representations have greatly advanced the state of the art in visual recognition, but the community has primarily focused on the setting where training and test set belong to the same dataset/distribution. However, models trained on one dataset do not generalize well to other datasets [3,5]. Human vision, which is robust to data/domain shifts, relies on shape in addition to texture/appearance, as shown in prior research. On the other hand, prior work in computer vision shows CNN representations are biased towards texture [4]. We propose a new shape-based representation which captures the medial axis transform and skeleton of an object. As shown in Fig. 1, shape is more robust to domain shifts than texture. We apply it in the domain generalization (DG) setting: methods are trained on a set of source domains, and are tested on a disjoint domain from which no data is available at training time. Unlike related prior shape work [7,8], which primarily targeted cross-modal retrieval and scene classification, our representation is denser than an edge map.
展开▼