We propose a Convolutional Neural Network (CNN)-based model "RotationNet,"which takes multi-view images of an object as input and jointly estimates itspose and object category. Unlike previous approaches that use known viewpointlabels for training, our method treats the viewpoint labels as latentvariables, which are learned in an unsupervised manner during the trainingusing an unaligned object dataset. RotationNet is designed to use only apartial set of multi-view images for inference, and this property makes ituseful in practical scenarios where only partial views are available. Moreover,our pose alignment strategy enables one to obtain view-specific featurerepresentations shared across classes, which is important to maintain highaccuracy in both object categorization and pose estimation. Effectiveness ofRotationNet is demonstrated by its superior performance to the state-of-the-artmethods of 3D object classification on 10- and 40-class ModelNet datasets. Wealso show that RotationNet, even trained without known poses, achieves thestate-of-the-art performance on an object pose estimation dataset. The code isavailable on https://github.com/kanezaki/rotationnet
展开▼