Revisiting Video Saliency Prediction in the Deep Learning Era

Wang Wenguan; Shen Jianbing; Xie Jianwen; Cheng Ming-Ming; Ling Haibin; Borji Ali

首页> 外文期刊>IEEE Transactions on Pattern Analysis and Machine Intelligence >Revisiting Video Saliency Prediction in the Deep Learning Era

【24h】

Revisiting Video Saliency Prediction in the Deep Learning Era

机译：在深度学习时代重新审视视频显着性预测

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Predicting where people look in static scenes, a.k.a visual saliency, has received significant research interest recently. However, relatively less effort has been spent in understanding and modeling visual attention over dynamic scenes. This work makes three contributions to video saliency research. First, we introduce a new benchmark, called DHF1K (Dynamic Human Fixation 1K), for predicting fixations during dynamic scene free-viewing, which is a long-time need in this field. DHF1K consists of 1K high-quality elaborately-selected video sequences annotated by 17 observers using an eye tracker device. The videos span a wide range of scenes, motions, object types and backgrounds. Second, we propose a novel video saliency model, called ACLNet (Attentive CNN-LSTM Network), that augments the CNN-LSTM architecture with a supervised attention mechanism to enable fast end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning a more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. Third, we perform an extensive evaluation of the state-of-the-art saliency models on three datasets : DHF1K, Hollywood-2, and UCF sports. An attribute-based analysis of previous saliency models and cross-dataset generalization are also presented. Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that ACLNet outperforms other contenders and has a fast processing speed (40 fps using a single GPU). Our code and all the results are available at https://github.com/wenguanwang/DHF1K.

机译：预测人们在静态场景中查看的地方，最近获得了显着的研究兴趣。然而，在了解和对动态场景中的视觉注意力方面所花费的努力相对较少。这项工作对视频显着性研究进行了三项贡献。首先，我们介绍了一个名为DHF1K（动态人体固定1K）的新基准，用于预测动态场景自由观看期间的固定，这是该领域的长期需要。 DHF1K由17个观察者使用眼睛跟踪器装置注释的1K高质量精心选择的视频序列组成。视频跨越各种场景，动议，对象类型和背景。其次，我们提出了一种新颖的视频显着模型，称为ACLNET（殷勤CNN-LSTM网络），可通过监督的注意机制增强CNN-LSTM架构，以实现快速端到端的关卡。注意机制明确地编码静态显着信息，从而允许LSTM专注于在连续帧中学习更灵活的时间显着表示。这种设计充分利用现有的大型静态固定数据集，避免过度装备，并显着提高训练效率和测试性能。第三，我们对三个数据集进行了广泛的评估，对三个数据集进行了最先进的显着模型：DHF1K，好莱坞-2和UCF运动。还提出了基于属性的分析，以前的显着模型和交叉数据集泛化分析。含有400K帧的超过1.2K测试视频的实验结果证明了ACLNET优于其他竞争者并具有快速处理速度（使用单个GPU 40 fps）。我们的代码和所有结果都可以在https://github.com/wenguanwang/dhf1k获得。

著录项

来源
《IEEE Transactions on Pattern Analysis and Machine Intelligence》 |2021年第1期|220-237|共18页
作者
Wang Wenguan; Shen Jianbing; Xie Jianwen; Cheng Ming-Ming; Ling Haibin; Borji Ali;
展开▼
作者单位

Beijing Inst Technol Sch Comp Sci Beijing Lab Intelligent Informat Technol Beijing 100081 Peoples R China|Incept Inst Artificial Intelligence Abu Dhabi U Arab Emirates;

Beijing Inst Technol Sch Comp Sci Beijing Lab Intelligent Informat Technol Beijing 100081 Peoples R China|Incept Inst Artificial Intelligence Abu Dhabi U Arab Emirates;

Hikvis Res Inst City Of Industry CA 91748 USA;

Nankai Univ Coll Comp Sci Nankai 300071 Peoples R China;

Temple Univ Dept Comp & Informat Sci Philadelphia PA 19122 USA;

MarkableAI New York NY 11201 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Video saliency; dynamic visual attention; benchmark; deep learning;

机译：视频显着性;动态视觉关注;基准;深度学习;

相似文献

外文文献
中文文献
专利

1. Overview of deep-learning based methods for salient object detection in videos [J] . Pattern Recognition: The Journal of the Pattern Recognition Society . 2020,第期

机译：基于深度学习的方法概述视频中的突出对象检测
2. A deep-learning based feature hybrid framework for spatiotemporal saliency detection inside videos [J] . Wang Zheng, Ren Jinchang, Zhang Dong, Neurocomputing . 2018,第APRa26期

机译：基于深度学习的特征混合框架，用于视频内部时空显着性检测
3. Saliency Prediction in the Deep Learning Era: Successes and Limitations [J] . Borji Ali IEEE Transactions on Pattern Analysis and Machine Intelligence . 2021,第2期

机译：深度学习时代的显着性预测：成功与限制
4. DeepVS: A Deep Learning Based Video Saliency Prediction Approach [C] . Lai Jiang, Mai Xu, Tie Liu, European conference on computer vision . 2018

机译：DeepVS：基于深度学习的视频显着性预测方法
5. Visual Saliency Analysis, Prediction, and Visualization: A Deep Learning Perspective [D] . Mahdi, Ali Majeed. 2019

机译：视觉显着性分析，预测和可视化：深度学习视角
6. A parallel spatiotemporal saliency and discriminative online learning method for visual target tracking in aerial videos [O] . Amirhossein Aghamohammadi, Mei Choo Ang, Elankovan A. Sundararajan, 2012

机译：航空视频视觉目标跟踪的并行时空显着性和判别式在线学习方法
7. DeepVS: A Deep Learning Based Video Saliency Prediction Approach [O] . Lai Jiang, Mai Xu, Tie Liu, 2018

机译：深度：基于深度学习的视频显着性预测方法

Revisiting Video Saliency Prediction in the Deep Learning Era

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅