vtGraphNet: Learning weakly-supervised scene graph for complex visual grounding

Lyu Fan; Feng Wei; Wang Song

首页> 外文期刊>Neurocomputing >vtGraphNet: Learning weakly-supervised scene graph for complex visual grounding

【24h】

vtGraphNet: Learning weakly-supervised scene graph for complex visual grounding

机译：Vtgraphnet：学习虚线监督的场景图，以实现复杂的视觉接地

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

As a challenging cross-modal task, current visual grounding is usually addressed by directly analyzing the unstructured scene and matching the query text with all region proposals, which is prone to errors, especially when the scene and/or query text are complex. In this paper, we study such complex visual grounding problem and propose to build a query dependent visual-textual (VT) scene graph to jointly understand the image and query text. To avoid the difficulty of obtaining ground-truth scene graphs, we propose vtGraphNet to effectively learn the bi-modal scene graph in a weakly-supervised way, where the only supervision is the manually annotated grounding region. Specifically, we first use an ARU Tagging model to sequentially tag every query word as either an attribute, a relationship or an auxiliary. If a word is tagged as attribute, we develop an attribute-assigning model to associate it to a region proposal. If a word is tagged as relationship, we develop a relationship-referring model to associate it to a pair of region proposals. A simple yet effective graph consistency loss function is constructed to constrain the above associations to form a feasible compact VT scene graph, from which discriminative region features can be extracted and used to locate the grounding object by classification. Extensive experiments on benchmark datasets validate the superiority of our approach in handling both simple and complex visual grounding tasks. (C) 2020 Elsevier B.V. All rights reserved.

机译：作为一个具有挑战性的跨模型任务，通常通过直接分析非结构化场景并将查询文本与所有区域提案匹配，尤其是当场景和/或查询文本复杂时，通常通过匹配查询文本来解决当前的视觉接地。在本文中，我们研究了这种复杂的视觉接地问题，并建议构建查询依赖视觉文本（VT）场景图，以共同理解图像和查询文本。为避免获得地面真理场景图的难度，我们提出了VTGraphnet以弱监督的方式有效地学习双模场景图，其中唯一的监督是手动注释的接地区域。具体地，我们首先使用ARU标记模型顺序地将每个查询单词作为属性，关系或辅助标记标记。如果将单词标记为属性，我们开发了一个属性分配模型，以将其关联到区域提案。如果一个单词被标记为关系，我们开发了一个接触关系模型，以将其与一对区域提案相关联。构造一个简单而有效的图形一致性损失功能以限制上述关联以形成可行的紧凑型VT场景图，可以通过分类来提取和用于定位接地对象的鉴别区域特征。基准数据集的广泛实验验证了我们处理简单和复杂的视觉接地任务方面的方法的优势。（c）2020 Elsevier B.v.保留所有权利。

著录项

来源
《Neurocomputing》 |2020年第6期|51-60|共10页
作者
Lyu Fan; Feng Wei; Wang Song;
展开▼
作者单位

Tianjin Univ Coll Intelligence & Comp Tianjin 300350 Peoples R China;

Tianjin Univ Coll Intelligence & Comp Tianjin 300350 Peoples R China;

Tianjin Univ Coll Intelligence & Comp Tianjin 300350 Peoples R China|Univ South Carolina Comp Sci & Engn Columbia SC 29208 USA;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Visual grounding; Referring expression comprehension;

机译：视觉接地;参考表达理解;
入库时间 2022-08-18 22:26:49

相似文献

外文文献
中文文献
专利

1. Learning physical properties in complex visual scenes: An intelligent machine for perceiving blood flow dynamics from static CT angiography imaging [J] . Gao Zhifan, Wang Xin, Sun Shanhui, Neural Networks: The Official Journal of the International Neural Network Society . 2020,第期

机译：在复杂的视觉场景中学习物理性质：一种智能机器，用于从静态CT血管造影成像中感知血流动力学
2. Neural Signatures of Spatial Statistical Learning: Characterizing the Extraction of Structure from Complex Visual Scenes [J] . Karuza Elisabeth A., Emberson Lauren L., Roser Matthew E., Journal of Cognitive Neuroscience . 2017,第12期

机译：空间统计学习的神经特征：表征从复杂视觉场景中提取结构的特征
3. Professional athletes have extraordinary skills for rapidly learning complex and neutral dynamic visual scenes [J] . Jocelyn Faubert Scientific reports. . 2013,第期

机译：专业运动员具有快速学习复杂和中性的动态视觉场景的非凡技能
4. Explanation-Based Weakly-Supervised Learning of Visual Relations with Graph Networks [C] . Federico Baldassarre, Kevin Smith, Josephine Sullivan, European Conference on Computer Vision . 2020

机译：基于解释的图形网络视觉关系的虚弱学习
5. THE EFFECTS OF VISUAL VARIABLES WITHIN A COMPLEX VISUAL SCENE ON DECISION PROCESSES (HEAD-UP-DISPLAYS, PERCEPTION). [D] . PATTERSON, MICHAEL JOSEPH. 1985

机译：复杂视觉场景中的视觉变量对决策过程（抬头显示，感知）的影响。
6. Professional athletes have extraordinary skills for rapidly learning complex and neutral dynamic visual scenes [O] . Jocelyn Faubert -1

机译：专业运动员具有非凡的技能可以快速学习复杂而中性的动态视觉场景
7. Explanation-Based Weakly-Supervised Learning of Visual Relations with Graph Networks [O] . Federico Baldassarre, Kevin Smith, Josephine Sullivan, 2020

机译：基于解释的图形网络视觉关系的虚弱学习

vtGraphNet: Learning weakly-supervised scene graph for complex visual grounding

摘要

著录项

相似文献

相关主题

期刊订阅