首页> 外文期刊>IEEE Transactions on Pattern Analysis and Machine Intelligence >Hierarchical Scene Parsing by Weakly Supervised Learning with Image Descriptions
【24h】

Hierarchical Scene Parsing by Weakly Supervised Learning with Image Descriptions

机译:具有图像描述的弱监督学习的层次场景解析

获取原文
获取原文并翻译 | 示例
           

摘要

This paper investigates a fundamental problem of scene understanding: how to parse a scene image into a structured configuration (i.e., a semantic object hierarchy with object interaction relations). We propose a deep architecture consisting of two networks: i) a convolutional neural network (CNN) extracting the image representation for pixel-wise object labeling and ii) a recursive neural network (RsNN) discovering the hierarchical object structure and the inter-object relations. Rather than relying on elaborative annotations (e.g., manually labeled semantic maps and relations), we train our deep model in a weakly-supervised learning manner by leveraging the descriptive sentences of the training images. Specifically, we decompose each sentence into a semantic tree consisting of nouns and verb phrases, and apply these tree structures to discover the configurations of the training images. Once these scene configurations are determined, then the parameters of both the CNN and RsNN are updated accordingly by back propagation. The entire model training is accomplished through an Expectation-Maximization method. Extensive experiments show that our model is capable of producing meaningful scene configurations and achieving more favorable scene labeling results on two benchmarks (i.e., PASCAL VOC2012 and SYSU-Scenes) compared with other state-of-the-art weakly-supervised deep learning methods. In particular, SYSU-Scenes contains more than 5,000 scene images with their semantic sentence descriptions, which is created by us for advancing research on scene parsing.
机译:本文研究了场景理解的一个基本问题:如何将场景图像解析为结构化配置(即具有对象交互关系的语义对象层次结构)。我们提出了一个由两个网络组成的深层架构:i)卷积神经网络(CNN)提取图像表示以进行像素级对象标记; ii)递归神经网络(RsNN)发现分层的对象结构和对象间关系。而不是依靠详尽的注释(例如,手动标记的语义图和关系),我们通过利用训练图像的描述性句子,以弱监督学习的方式训练我们的深度模型。具体来说,我们将每个句子分解为由名词和动词短语组成的语义树,并应用这些树结构来发现训练图像的配置。一旦确定了这些场景配置,就可以通过反向传播相应地更新CNN和RsNN的参数。整个模型训练是通过期望最大化方法完成的。广泛的实验表明,与其他最新的弱监督深度学习方法相比,我们的模型能够在两个基准(即PASCAL VOC2012和SYSU-Scenes)上产生有意义的场景配置并获得更好的场景标记结果。尤其是,SYSU-Scenes包含5,000多个带有语义句子描述的场景图像,这些图像是我们为推进场景解析研究而创建的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号