首页> 外文会议>European conference on computer vision >Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition
【24h】

Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition

机译:动态眼动数据集和视觉动作识别的学习显着性模型

获取原文

摘要

Systems based on bag-of-words models operating on image features collected at maxima of sparse interest point operators have been extremely successful for both computer-based visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with visual processing in biological systems that operate in "saccade and fixate" regimes, the knowledge, methodology, and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions aiming to bridge this gap. First, we complement existing state-of-the art large-scale dynamic computer vision datasets like Hollywood-2[1] and UCF Sports[2] with human eye movements collected under the ecological constraints of the visual action recognition task. To our knowledge these are the first massive human eye tracking datasets of significant size to be collected for video (497,107 frames, each viewed by 16 subjects), unique in terms of their (a) large scale and computer vision relevance, (b) dynamic, video stimuli, (c) task control, as opposed to free-viewing. Second, we introduce novel dynamic consistency and alignment models, which underline the remarkable stability of patterns of visual search among subjects. Third, we leverage the massive amounts of collected data in order to pursue studies and build automatic, end-to-end trainable computer vision systems based on human eye movements. Our studies not only shed light on the differences between computer vision spatio-temporal interest point image sampling strategies and human fixations, as well as their impact for visual recognition performance, but also demonstrate that human fixations can be accurately predicted, and when used in an end-to-end automatic system, leveraging some of the most advanced computer vision practice, can lead to state of the art results.
机译:基于词袋模型的系统基于稀疏兴趣点运算符的最大值所收集的图像特征进行操作,在基于计算机的视觉对象和动作识别任务方面都取得了巨大的成功。尽管基于兴趣点的稀疏识别方法与以“扫视和固定”状态运行的生物系统中的视觉处理并不矛盾,但人类和计算机视觉社区中的知识,方法和重点仍然截然不同。在此,我们做出了三项旨在弥合这一差距的贡献。首先,我们将在视觉动作识别任务的生态约束下收集的人眼运动补充现有的最先进的大型动态计算机视觉数据集,例如Hollywood-2 [1]和UCF Sports [2]。据我们所知,这是第一个大规模的大规模人眼跟踪数据集,可用于视频采集(497,107帧,每个由16个对象观看),在以下方面是独一无二的:(a)大规模和计算机视觉相关性,(b)动态,视频刺激,(c)任务控制,而不是自由观看。其次,我们介绍了新颖的动态一致性和对齐模型,突显了主题之间视觉搜索模式的出色稳定性。第三,我们利用大量收集的数据来进行研究,并基于人眼的运动来构建自动的,端对端可训练的计算机视觉系统。我们的研究不仅揭示了计算机视觉时空兴趣点图像采样策略与人工注视之间的差异,以及它们对视觉识别性能的影响,而且还证明了可以准确地预测人类注视以及将其用于计算机视觉中。端到端自动系统,利用一些最先进的计算机视觉实践,可以带来最先进的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号