...
首页> 外文期刊>International Journal of Multimedia Information Retrieval >On-the-fly learning for visual search of large-scale image and video datasets
【24h】

On-the-fly learning for visual search of large-scale image and video datasets

机译:动态学习,用于可视化大规模图像和视频数据集

获取原文
获取原文并翻译 | 示例
           

摘要

The objective of this work is to visually search large-scale video datasets for semantic entities specified by a text query. The paradigm we explore is constructing visual models for such semantic entities on-the-fly, i.e. at run time, by using an image search engine to source visual training data for the text query. The approach combines fast and accurate learning and retrieval, and enables videos to be returned within seconds of specifying a query. We describe three classes of queries, each with its associated visual search method: object instances (using a bag of visual words approach for matching); object categories (using a discriminative classifier for ranking key frames); and faces (using a discriminative classifier for ranking face tracks). We discuss the features suitable for each class of query, for example Fisher vectors or features derived from convolutional neural networks (CNNs), and how these choices impact on the trade-off between three important performance measures for a real-time system of this kind, namely: (1) accuracy, (2) memory footprint, and (3) speed. We also discuss and compare a number of important implementation issues, such as how to remove ‘outliers’ in the downloaded images efficiently, and how to best obtain a single descriptor for a face track. We also sketch the architecture of the real-time on-the-fly system. Quantitative results are given on a number of large-scale image and video benchmarks (e.g. TRECVID INS, MIRFLICKR-1M), and we further demonstrate the performance and real-world applicability of our methods over a dataset sourced from 10,000h of unedited footage from BBC News, comprising 5M+ key frames.
机译:这项工作的目的是在视觉上搜索大型视频数据集,以查找由文本查询指定的语义实体。我们探索的范例是即时(即在运行时)通过使用图像搜索引擎为文本查询提供视觉训练数据来为此类语义实体构建视觉模型。该方法结合了快速,准确的学习和检索功能,并允许在指定查询后的几秒钟内返回视频。我们描述了三类查询,每种查询都有其关联的视觉搜索方法:对象实例(使用视觉单词袋进行匹配);对象类别(使用判别式分类器对关键帧进行排名);和面孔(使用判别式分类器对面孔轨迹进行排名)。我们将讨论适用于每类查询的功能,例如Fisher向量或从卷积神经网络(CNN)派生的功能,以及这些选择如何影响这种实时系统的三个重要性能指标之间的权衡,即:(1)准确性,(2)内存占用量和(3)速度。我们还将讨论和比较许多重要的实现问题,例如如何有效地删除下载的图像中的“异常值”,以及如何最好地获得面部轨迹的单个描述符。我们还绘制了实时实时系统的架构。在许多大型图像和视频基准(例如TRECVID INS,MIRFLICKR-1M)上给出了定量结果,并且我们进一步证明了我们的方法在来自10,000h未编辑素材的数据集上的性能和现实适用性。英国广播公司新闻,包含500万多个关键帧。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号