Object Referring in Videos with Language and Human Gaze

机译：具有语言和人眼注视的视频中的对象引用

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We investigate the problem of object referring (OR) i.e. to localize a target object in a visual scene coming with a language description. Humans perceive the world more as continued video snippets than as static images, and describe objects not only by their appearance, but also by their spatio-temporal context and motion features. Humans also gaze at the object when they issue a referring expression. Existing works for OR mostly focus on static images only, which fall short in providing many such cues. This paper addresses OR in videos with language and human gaze. To that end, we present a new video dataset for OR, with 30, 000 objects over 5, 000 stereo video sequences annotated for their descriptions and gaze. We further propose a novel network model for OR in videos, by integrating appearance, motion, gaze, and spatio-temporal context into one network. Experimental results show that our method effectively utilizes motion cues, human gaze, and spatio-temporal context. Our method outperforms previous OR methods. For dataset and code, please refer https://people.ee.ethz.ch/~arunv/ORGaze.html.

机译：我们研究对象引用（OR）的问题，即在带有语言描述的视觉场景中定位目标对象。人类将世界更多地看作是连续的视频片段，而不是静态图像，并且不仅通过外观，还通过时空上下文和运动特征来描述对象。当人们发布引用表达时，他们也会注视对象。现有的OR作品大多只关注静态图像，而静态图像无法提供许多此类提示。本文针对具有语言和人眼注视的视频中的“或”进行了介绍。为此，我们提出了一个用于OR的新视频数据集，在5个，000个立体声视频序列中，有30,000个对象被标注以描述和凝视。通过将外观，动作，凝视和时空上下文整合到一个网络中，我们进一步提出了一种用于视频中OR的新颖网络模型。实验结果表明，我们的方法有效地利用了运动线索，人的视线和时空背景。我们的方法优于以前的OR方法。有关数据集和代码，请参阅https://people.ee.ethz.ch/~arunv/ORGaze.html。

著录项

来源
《IEEE/CVF Conference on Computer Vision and Pattern Recognition》|2018年|4129-4138|共10页
会议地点 Salt Lake City(US)
作者
Arun Balajee Vasudevan; Dengxin Dai; Luc Van Gool;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Computer vision; Pattern recognition;

机译：计算机视觉;模式识别;
入库时间 2022-08-26 14:35:29

相似文献

外文文献
中文文献
专利

1. Do Objects Have A Spatial “Inside”?, SR Tends To Mistake Deconstruction For Nominalism, Subjectivism And Meillassoux's Correlationism, Language Is Radically Nonhuman--Even When Humans Use It, Quantum Theory Positively Guarantees That Real Objects Exist! [J] . K.N.P. Kumar, C.J. Prabhakara, S.K. Narasimha Murthy, Advances in Physics Theories and Applications . 2013,第1期

机译：Do对象是否有空间“内部”？，SR倾向于错误地误解名义主义，主观主义和Meillassoux的相关主义，语言是根本性的非人 - 即使人类使用它，昆腾理论也积极保证存在真实的物体！
2. Individual differences in object-processing explain the relationship between early gaze-following and later language development [J] . Okumura Yuko, Kanakogi Yasuhiro, Kobayashi Tessei, Cognition: International Journal of Cognitive Psychology . 2017,第期

机译：对象处理的个体差异解释了早期凝视跟随和后来语言开发之间的关系
3. Objects tracking in video: a object-oriented approach using Unified Modeling Language [J] . Nilesh J. Uke, Ravindra C. Thool International journal of computational vision and robotics . 2015,第2期

机译：视频中的对象跟踪：使用统一建模语言的面向对象方法
4. Object Referring in Videos with Language and Human Gaze [C] . Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool IEEE/CVF Conference on Computer Vision and Pattern Recognition . 2018

机译：对象引用语言和人类凝视的视频
5. Multimodal Learning with Minimal Human Supervision from Videos and Natural Language [D] . Xiao, Fanyi. 2020

机译：来自视频和自然语言的最小人类监督的多式化学习
6. Referring strategies in American Sign Language and English (with co-speech gesture): The role of modality in referring to non-nameable objects [O] . ZED SEVCIKOVA SEHYR, BRENDA NICODEMUS, JENNIFER PETRICH, -1

机译：美式手语和英语（带有同声手势）的引用策略：情态在引用不可命名对象中的作用
7. Object Referring in Videos with Language and Human Gaze [O] . Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool 2018

机译：对象引用语言和人类凝视的视频
8. Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild. [R] . Thomason, J., Venugopalan, S., Guadarrama, S., 2014

机译：整合语言和视觉，生成自然语言对野外视频的描述。

Object Referring in Videos with Language and Human Gaze

摘要

著录项

相似文献

相关主题

期刊订阅