首页> 外文会议>IEEE/CVF Conference on Computer Vision and Pattern Recognition >Object Referring in Videos with Language and Human Gaze
【24h】

Object Referring in Videos with Language and Human Gaze

机译:具有语言和人眼注视的视频中的对象引用

获取原文

摘要

We investigate the problem of object referring (OR) i.e. to localize a target object in a visual scene coming with a language description. Humans perceive the world more as continued video snippets than as static images, and describe objects not only by their appearance, but also by their spatio-temporal context and motion features. Humans also gaze at the object when they issue a referring expression. Existing works for OR mostly focus on static images only, which fall short in providing many such cues. This paper addresses OR in videos with language and human gaze. To that end, we present a new video dataset for OR, with 30, 000 objects over 5, 000 stereo video sequences annotated for their descriptions and gaze. We further propose a novel network model for OR in videos, by integrating appearance, motion, gaze, and spatio-temporal context into one network. Experimental results show that our method effectively utilizes motion cues, human gaze, and spatio-temporal context. Our method outperforms previous OR methods. For dataset and code, please refer https://people.ee.ethz.ch/~arunv/ORGaze.html.
机译:我们研究对象引用(OR)的问题,即在带有语言描述的视觉场景中定位目标对象。人类将世界更多地看作是连续的视频片段,而不是静态图像,并且不仅通过外观,还通过时空上下文和运动特征来描述对象。当人们发布引用表达时,他们也会注视对象。现有的OR作品大多只关注静态图像,而静态图像无法提供许多此类提示。本文针对具有语言和人眼注视的视频中的“或”进行了介绍。为此,我们提出了一个用于OR的新视频数据集,在5个,000个立体声视频序列中,有30,000个对象被标注以描述和凝视。通过将外观,动作,凝视和时空上下文整合到一个网络中,我们进一步提出了一种用于视频中OR的新颖网络模型。实验结果表明,我们的方法有效地利用了运动线索,人的视线和时空背景。我们的方法优于以前的OR方法。有关数据集和代码,请参阅https://people.ee.ethz.ch/~arunv/ORGaze.html。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号