...
首页> 外文期刊>IEEE Transactions on Pattern Analysis and Machine Intelligence >Revisiting Video Saliency Prediction in the Deep Learning Era
【24h】

Revisiting Video Saliency Prediction in the Deep Learning Era

机译:在深度学习时代重新审视视频显着性预测

获取原文
获取原文并翻译 | 示例

摘要

Predicting where people look in static scenes, a.k.a visual saliency, has received significant research interest recently. However, relatively less effort has been spent in understanding and modeling visual attention over dynamic scenes. This work makes three contributions to video saliency research. First, we introduce a new benchmark, called DHF1K (Dynamic Human Fixation 1K), for predicting fixations during dynamic scene free-viewing, which is a long-time need in this field. DHF1K consists of 1K high-quality elaborately-selected video sequences annotated by 17 observers using an eye tracker device. The videos span a wide range of scenes, motions, object types and backgrounds. Second, we propose a novel video saliency model, called ACLNet (Attentive CNN-LSTM Network), that augments the CNN-LSTM architecture with a supervised attention mechanism to enable fast end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning a more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. Third, we perform an extensive evaluation of the state-of-the-art saliency models on three datasets : DHF1K, Hollywood-2, and UCF sports. An attribute-based analysis of previous saliency models and cross-dataset generalization are also presented. Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that ACLNet outperforms other contenders and has a fast processing speed (40 fps using a single GPU). Our code and all the results are available at https://github.com/wenguanwang/DHF1K.
机译:预测人们在静态场景中查看的地方,最近获得了显着的研究兴趣。然而,在了解和对动态场景中的视觉注意力方面所花费的努力相对较少。这项工作对视频显着性研究进行了三项贡献。首先,我们介绍了一个名为DHF1K(动态人体固定1K)的新基准,用于预测动态场景自由观看期间的固定,这是该领域的长期需要。 DHF1K由17个观察者使用眼睛跟踪器装置注释的1K高质量精心选择的视频序列组成。视频跨越各种场景,动议,对象类型和背景。其次,我们提出了一种新颖的视频显着模型,称为ACLNET(殷勤CNN-LSTM网络),可通过监督的注意机制增强CNN-LSTM架构,以实现快速端到端的关卡。注意机制明确地编码静态显着信息,从而允许LSTM专注于在连续帧中学习更灵活的时间显着表示。这种设计充分利用现有的大型静态固定数据集,避免过度装备,并显着提高训练效率和测试性能。第三,我们对三个数据集进行了广泛的评估,对三个数据集进行了最先进的显着模型:DHF1K,好莱坞-2和UCF运动。还提出了基于属性的分析,以前的显着模型和交叉数据集泛化分析。含有400K帧的超过1.2K测试视频的实验结果证明了ACLNET优于其他竞争者并具有快速处理速度(使用单个GPU 40 fps)。我们的代码和所有结果都可以在https://github.com/wenguanwang/dhf1k获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号