...
首页> 外文期刊>IEEE Transactions on Vehicular Technology >Video Foreground Extraction Using Multi-View Receptive Field and Encoder–Decoder DCNN for Traffic and Surveillance Applications
【24h】

Video Foreground Extraction Using Multi-View Receptive Field and Encoder–Decoder DCNN for Traffic and Surveillance Applications

机译:使用多视图接收场和编解码器DCNN进行视频前景提取,以用于交通和监视应用

获取原文
获取原文并翻译 | 示例
           

摘要

The automatic detection of foreground (FG) objects in videos is a demanding area of computer vision, with essential applications in video-based traffic analysis and surveillance. New solutions have attempted exploiting deep neural network (DNN) for this purpose. In DNN, learning agents, i.e., features for video FG object segmentation is nontrivial, unlike image segmentation. It is a temporally processed decision-making problem, where the agents involved are the spatial and temporal correlations of the FG objects and the background (BG) of the scene. To handle this and to overcome the conventional DL models poor delineation at the borders of FG regions due to fixed-view receptive filed-based learning, this work introduces a Multi-view Receptive Field Encoder-Decoder Convolutional Neural Network called MvRF-CNN. The main contribution of the model is harnessing multiple views of convolutional (conv) kernels with residual feature fusions at early, mid and late stages in an encoder-decoder (EnDec) architecture. It enhances the ability of the model to learn condition-invariant agents resulting in highly delineated FG masks when compared to the existing approaches from heuristic- to DL-based techniques. The model is trained with sequence-specific labeled samples to predict scene-specific pixel-level labels of FG objects in near static scenes with a minute dynamism. The experimental study on 37 video sequences from traffic and surveillance scenarios that include complex environments, viz. dynamic background, camera jittery, intermittent object motion, scenes with cast shadows, night videos, and lousy weather proves the effectiveness of the model. The study covers two input configurations: a 3-channel (RGB) single frame and a 3-channel double-frame with a BG such that two consecutive grayscale frames stacked with a prior BG model. The ablation investigations are also conducted to show the importance of transfer learning (TL) and mid-fusion approaches for enhancing the segmentation performance and the models robustness on failure modes: when there is lack of manually annotated hard ground truths (HGT) and testing the model under non-scene-specific videos. In overall, the model achieves a figure-of-merit of ${95%}$ and 42 $FPS$ of mean average performance.
机译:视频中前景(FG)对象的自动检测是计算机视觉的一个重要领域,在基于视频的流量分析和监视中具有必不可少的应用。为此,新的解决方案已尝试利用深度神经网络(DNN)。在DNN中,与图像分割不同,学习代理(即视频FG对象分割的功能)是不平凡的。这是一个经过时间处理的决策问题,其中所涉及的主体是FG对象与场景背景(BG)的时空相关性。为了解决这个问题并克服传统的DL模型由于基于固定视图接受字段的学习而导致的FG区域边界处的轮廓不佳,这项工作引入了一种称为MvRF-CNN的多视图接受场编码器/解码器卷积神经网络。该模型的主要贡献是在编码器/解码器(EnDec)架构的早期,中期和后期利用卷积(conv)内核的多个视图与残差特征融合。与现有的基于启发式技术到基于DL的方法相比,它增强了模型学习条件不变代理的能力,从而导致了高度划定的FG掩码。该模型使用序列特定的标记样本进行训练,以预测接近静态场景中FG对象具有特定动态的场景特定像素级标记。来自交通和监视场景的37个视频序列的实验研究,包括复杂环境。动态背景,相机抖动,物体间歇性运动,带有阴影的场景,夜视和恶劣的天气证明了该模型的有效性。这项研究涵盖了两种输入配置:一个具有BG的3通道(RGB)单帧和一个3通道双帧,从而使两个连续的灰度帧与先前的BG模型堆叠在一起。还进行了消融研究,以显示转移学习(TL)和中间融合方法对于增强分割性能和模型在故障模式下的鲁棒性的重要性:当缺少手动注释的硬性基础事实(HGT)并进行测试时,在非特定场景的视频下进行建模。总体而言,该模型的平均绩效为$ {95 %} $和42 FPS $。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号