首页> 外文期刊>Neurocomputing >vtGraphNet: Learning weakly-supervised scene graph for complex visual grounding
【24h】

vtGraphNet: Learning weakly-supervised scene graph for complex visual grounding

机译:Vtgraphnet:学习虚线监督的场景图,以实现复杂的视觉接地

获取原文
获取原文并翻译 | 示例
       

摘要

As a challenging cross-modal task, current visual grounding is usually addressed by directly analyzing the unstructured scene and matching the query text with all region proposals, which is prone to errors, especially when the scene and/or query text are complex. In this paper, we study such complex visual grounding problem and propose to build a query dependent visual-textual (VT) scene graph to jointly understand the image and query text. To avoid the difficulty of obtaining ground-truth scene graphs, we propose vtGraphNet to effectively learn the bi-modal scene graph in a weakly-supervised way, where the only supervision is the manually annotated grounding region. Specifically, we first use an ARU Tagging model to sequentially tag every query word as either an attribute, a relationship or an auxiliary. If a word is tagged as attribute, we develop an attribute-assigning model to associate it to a region proposal. If a word is tagged as relationship, we develop a relationship-referring model to associate it to a pair of region proposals. A simple yet effective graph consistency loss function is constructed to constrain the above associations to form a feasible compact VT scene graph, from which discriminative region features can be extracted and used to locate the grounding object by classification. Extensive experiments on benchmark datasets validate the superiority of our approach in handling both simple and complex visual grounding tasks. (C) 2020 Elsevier B.V. All rights reserved.
机译:作为一个具有挑战性的跨模型任务,通常通过直接分析非结构化场景并将查询文本与所有区域提案匹配,尤其是当场景和/或查询文本复杂时,通常通过匹配查询文本来解决当前的视觉接地。在本文中,我们研究了这种复杂的视觉接地问题,并建议构建查询依赖视觉文本(VT)场景图,以共同理解图像和查询文本。为避免获得地面真理场景图的难度,我们提出了VTGraphnet以弱监督的方式有效地学习双模场景图,其中唯一的监督是手动注释的接地区域。具体地,我们首先使用ARU标记模型顺序地将每个查询单词作为属性,关系或辅助标记标记。如果将单词标记为属性,我们开发了一个属性分配模型,以将其关联到区域提案。如果一个单词被标记为关系,我们开发了一个接触关系模型,以将其与一对区域提案相关联。构造一个简单而有效的图形一致性损失功能以限制上述关联以形成可行的紧凑型VT场景图,可以通过分类来提取和用于定位接地对象的鉴别区域特征。基准数据集的广泛实验验证了我们处理简单和复杂的视觉接地任务方面的方法的优势。 (c)2020 Elsevier B.v.保留所有权利。

著录项

  • 来源
    《Neurocomputing》 |2020年第6期|51-60|共10页
  • 作者

    Lyu Fan; Feng Wei; Wang Song;

  • 作者单位

    Tianjin Univ Coll Intelligence & Comp Tianjin 300350 Peoples R China;

    Tianjin Univ Coll Intelligence & Comp Tianjin 300350 Peoples R China;

    Tianjin Univ Coll Intelligence & Comp Tianjin 300350 Peoples R China|Univ South Carolina Comp Sci & Engn Columbia SC 29208 USA;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Visual grounding; Referring expression comprehension;

    机译:视觉接地;参考表达理解;
  • 入库时间 2022-08-18 22:26:49

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号