Fine-grained image classification is to recognize hundreds of subcategoriesbelonging to the same basic-level category, which is a highly challenging taskdue to the quite subtle visual distinctions among similar subcategories. Mostexisting methods generally learn part detectors to discover discriminativeregions for better performance. However, not all localized parts are beneficialand indispensable for classification, and the setting for number of partdetectors relies heavily on prior knowledge as well as experimental results. Asis known to all, when we describe the object of an image into text via naturallanguage, we only focus on the pivotal characteristics, and rarely payattention to common characteristics as well as the background areas. This is aninvoluntary transfer from human visual attention to textual attention, whichleads to the fact that textual attention tells us how many and which parts arediscriminative and significant. So textual attention of natural languagedescriptions could help us to discover visual attention in image. Inspired bythis, we propose a visual-textual attention driven fine-grained representationlearning (VTA) approach, and its main contributions are: (1) Fine-grainedvisual-textual pattern mining devotes to discovering discriminativevisual-textual pairwise information for boosting classification through jointlymodeling vision and text with generative adversarial networks (GANs), whichautomatically and adaptively discovers discriminative parts. (2) Visual-textualrepresentation learning jointly combine visual and textual information, whichpreserves the intra-modality and inter-modality information to generatecomplementary fine-grained representation, and further improve classificationperformance. Experiments on the two widely-used datasets demonstrate theeffectiveness of our VTA approach, which achieves the best classificationaccuracy.
展开▼