首页> 外文期刊>Journal of vision >Statistics of spatial-temporal concatenations of features at human fixations in action classification
【24h】

Statistics of spatial-temporal concatenations of features at human fixations in action classification

机译:动作分类中人类注视点的特征的时空串联统计

获取原文
       

摘要

Humans can detect, recognize, and classify a range of actions quickly. What are the spatial-temporal features and computations that underlie this ability? Global representations such as spatial-temporal volumes can be highly informative, but depend on segmentation and tracking. Local representations such as histograms of optic flow lack descriptive power and require extensive training. Recently, we developed a model in which any human action is encoded by a spatial-temporal concatenation of natural action structures (NASs), i.e., sequences of structured patches in human actions at multiple spatial-temporal scales. We compiled NASs from videos of natural human actions, examined the statistics of NASs, and selected a set of NASs that are highly informative and used them as features for action classification. We found that the NASs obtained in this way achieved a significantly better recognition performance than simple spatial-temporal features. To examine to which extend this model accounts for human action understanding, we hypothesized that humans search for informative NASs in this task and performed visual psychophysical studies. We asked 12 subjects with normal vision to classify 500 videos of human actions while tracking their fixations with an EyeLink II eye tracker. We examined statistics of the NASs compiled at the recorded fixations and found that human observers' fixations were sparsely distributed and usually deployed to locations in space-time where concatenations of local features are informative. We selected a set of NASs compiled at the fixations and used them as features for action classification. We found that the classification accuracy is comparable to human performance and to that of the same model but with automatically selected NASs. We concluded that encoding natural human actions in terms of NASs and their spatial-temporal concatenations accounts for aspects of human action understanding.
机译:人类可以快速检测,识别和分类一系列动作。这种能力背后的时空特征和计算是什么?诸如时空体积之类的全局表示可以提供大量信息,但取决于分段和跟踪。诸如光流直方图的局部表示缺乏描述能力,需要进行大量培训。最近,我们开发了一种模型,其中任何人类动作都由自然动作结构(NAS)的时空串联编码,即人类动作在多个时空尺度上的结构化补丁序列。我们从人类自然动作的视频中汇编了NAS,检查了NAS的统计数据,并选择了一组具有丰富信息的NAS,并将其用作动作分类的功能。我们发现,以这种方式获得的NAS比简单的时空特征具有更好的识别性能。为了检查该模型对人类行为理解的解释范围,我们假设人类在此任务中搜索信息丰富的NAS,并进行了视觉心理物理研究。我们要求12位视力正常的受试者在使用EyeLink II眼动仪追踪其注视时,对500个人类动作视频进行分类。我们检查了在记录的注视点上汇编的NAS的统计信息,发现人类观察者的注视点分布稀疏,通常部署在时空中结合了当地特征的位置。我们选择了在固定点编译的一组NAS,并将它们用作动作分类的功能。我们发现分类准确度可与人类表现相媲美,并且与具有自动选择的NAS的相同型号的表现相当。我们得出的结论是,根据NAS及其时空级联对自然人类行为进行编码可解释人类对行为的理解。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号