Learning semantically disentangled representations is important for various computer vision tasks, such as image generation and classification. Although it is possible to learn an effective representation in supervised settings, there are problems requiring enormous effort focused in data collection and labeling, and the difficulty in labeling continuously changing events like facial expressions is significant. In this paper, we propose a method for separating the latent representation of facial images into identity factors and facial expression factors using the variational autoencoder (VAE) framework. In our method, we only use subject labels to control training, and we do not use information attached to facial expressions like emotional labels. The separation between extracted facial expression factors and identity features is very useful for controlling image generation and for classifying facial expressions. Using this latent representation, we also suggest a new approach for facial expression recognition with a simple clustering method, which is based on Euclidean distance. Our classification method dramatically reduces the cost of labeling. The experimental results show that our method successfully disentangles the representation of facial images and separates the latent representation into identity and facial expression factors. Moreover, in a facial expression recognition task, our approach shows advantages over the baseline method without supervision.
展开▼