...
首页> 外文期刊>IEEE Transactions on Circuits and Systems for Video Technology >Fine-Grained Visual-Textual Representation Learning
【24h】

Fine-Grained Visual-Textual Representation Learning

机译:细粒度视觉文本代表学习

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Fine-grained visual categorization is to recognize hundreds of subcategories belonging to the same basic-level category, which is a highly challenging task due to the quite subtle and local visual distinctions among similar subcategories. Most existing methods generally learn part detectors to discover discriminative regions for better categorization performance. However, not all parts are beneficial and indispensable for visual categorization, and the setting of part detector number heavily relies on prior knowledge as well as experimental validation. As is known to all, when we describe the object of an image via textual descriptions, we mainly focus on the pivotal characteristics and rarely pay attention to common characteristics as well as the background areas. This is an involuntary transfer from human visual attention to textual attention, which leads to the fact that textual attention tells us how many and which parts are discriminative and significant to categorization. So, textual attention could help us to discover visual attention in the image. Inspired by this, we propose a fine-grained visual-textual representation learning (VTRL) approach, and its main contributions are: 1) fine-grained visual-textual pattern mining devotes to discovering discriminative visual-textual pairwise information for boosting categorization performance through jointly modeling vision and text with generative adversarial networks, which automatically and adaptively discovers discriminative parts and 2) VTRL jointly combines visual and textual information, which preserves the intra-modality and inter-modality information to generate complementary fine-grained representation, as well as further improves categorization performance. Comprehensive experimental results on the widely used CUB-200-2011 and Oxford Flowers-102 datasets demonstrate the effectiveness of our VTRL approach, which achieves the best categorization accuracy compared with the state-of-the-art methods.
机译:细粒度的视觉分类是识别属于相同基本级别类别的数百个子类别,这是一个高度具有挑战性的任务,因为类似的子类别中的微妙和局部视觉区别。大多数现有方法通常学习部分探测器以发现识别性地区以获得更好的分类性能。然而,并非所有部分都是有益的,可视于视觉分类,并且零件检测器数量的设置大量依赖于先前的知识以及实验验证。所有所有人都知道,当我们通过文本描述描述图像的对象时,我们主要关注枢轴特征,很少关注共同的特征和背景区域。这是从人类视觉关注对文本关注的不自愿转移,这导致文本关注告诉我们有多少以及哪些部件是歧视性和分类的重要性。因此,文本的注意力可以帮助我们发现图像中的视觉注意。受此启发,我们提出了一个细粒度的视觉文本表示学习(VTRL)方法,其主要贡献是:1)细粒度的视觉文本模式挖掘致力于发现通过促进分类性能的识别性视觉文本成对信息共同建模视觉和文本与生成的对抗网络,它自动和自适应地发现鉴别部分和2)VTRL联合结合了视觉和文本信息,这保留了模特内和模特间信息以产生互补的细粒度表示,以及进一步提高分类性能。广泛使用的CUB-200-2011和牛津鲜花-102数据集的综合实验结果证明了我们VTRL方法的有效性,与最先进的方法相比,实现了最佳分类准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号