首页> 外文会议>International Conference on Computer Vision >Frame-to-Frame Aggregation of Active Regions in Web Videos for Weakly Supervised Semantic Segmentation
【24h】

Frame-to-Frame Aggregation of Active Regions in Web Videos for Weakly Supervised Semantic Segmentation

机译:Web视频中活动区域的帧到帧聚合,用于弱监督语义分割

获取原文

摘要

When a deep neural network is trained on data with only image-level labeling, the regions activated in each image tend to identify only a small region of the target object. We propose a method of using videos automatically harvested from the web to identify a larger region of the target object by using temporal information, which is not present in the static image. The temporal variations in a video allow different regions of the target object to be activated. We obtain an activated region in each frame of a video, and then aggregate the regions from successive frames into a single image, using a warping technique based on optical flow. The resulting localization maps cover more of the target object, and can then be used as proxy ground-truth to train a segmentation network. This simple approach outperforms existing methods under the same level of supervision, and even approaches relying on extra annotations. Based on VGG-16 and ResNet 101 backbones, our method achieves the mIoU of 65.0 and 67.4, respectively, on PASCAL VOC 2012 test images, which represents a new state-of-the-art.
机译:当仅使用图像级标签的数据训练深层神经网络时,每个图像中激活的区域倾向于仅识别目标对象的一小部分。我们提出了一种方法,该方法使用从网络自动收集的视频通过使用时间信息来识别目标对象的较大区域,该时间信息不存在于静态图像中。视频中的时间变化允许激活目标对象的不同区域。我们使用视频流的扭曲技术,在视频的每个帧中获得一个激活的区域,然后将连续帧中的区域聚合到单个图像中。生成的定位图覆盖了更多的目标对象,然后可以用作代理地面真相来训练分割网络。这种简单的方法在相同的监督级别下甚至胜过现有方法,甚至依赖于附加注释的方法也是如此。基于VGG-16和ResNet 101主干,我们的方法在PASCAL VOC 2012测试图像上分别达到了65.0和67.4的mIoU,这代表了最新的技术水平。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号