首页> 外文OA文献 >Large Scale Nearest Neighbor Search - Theories, Algorithms, and Applications
【2h】

Large Scale Nearest Neighbor Search - Theories, Algorithms, and Applications

机译:大规模最近邻居搜索-理论,算法和应用

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

We are witnessing a data explosion era, in which huge data sets of billions or more samples represented by high-dimensional feature vectors can be easily found on the Web, enterprise data centers, surveillance sensor systems, and so on. On these large scale data sets, nearest neighbor search is fundamental for lots of applications including content based search/retrieval, recommendation, clustering, graph and social network research, as well as many other machine learning and data mining problems. Exhaustive search is the simplest and most straightforward way for nearest neighbor search, but it can not scale up to huge data set at the sizes as mentioned above. To make large scale nearest neighbor search practical, we need the online search step to be sublinear in terms of the database size, which means offline indexing is necessary. Moreover, to achieve sublinear search time, we usually need to make some sacrifice on the search accuracy, and hence we can often only obtain approximate nearest neighbor instead of exact nearest neighbor. In other words, by large scale nearest neighbor search, we aim at approximate nearest neighbor search methods with sublinear online search time via offline indexing. To some extent, indexing a vector dataset for (sublinear time) approximate search can be achieved by partitioning the feature space to different regions, and mapping each point to its closet regions. There are different kinds of partition structures, for example, tree based partition, hashing based partition, clustering/quantization based partition, etc. From the viewpoint of how the data partition function is generated, the partition methods can be grouped into two main categories: 1. data independent (random) partition such as locality sensitive hashing, randomized trees/forests methods, etc.; 2. data dependent (optimized) partition, such as compact hashing, quantization based indexing methods, and some tree based methods like kd-tree, pca tree, etc. With the offline indexing/partitioning, online approximate nearest neighbor search usually consists of three steps: locate the query region that the query point falls in, obtain candidates which are the database points in the regions near the query region, and rerank/return candidates. For large scale nearest neighbor search, the key question is: how to design the optimal offline indexing, such that the online search performance is the best, or more specifically, the online search can be as fast as possible, while meeting a required accuracy? In this thesis, we have studied theories, algorithms, systems and applications for (approximate) nearest neighbor search on large scale data sets, for both indexing with random partition and indexing with learning based partition. Our specific main contributions are: 1. We unify various nearest neighbor search methods into the data partition framework, and provide a general formulation of optimal data partition, which supports fastest search speed while satisfying a required search accuracy. The formulation is general, and can be used to explain most existing (sublinear) large scale approximate nearest neighbor search methods. 2. For indexing with data-independent partitions, we have developed theories on their lower and upper bounds of time and space complexity, based on the optimal data partition formulation. The bounds are applicable for a general group of methods called Nearest Neighbor Preferred Hashing and Nearest Neighbor Preferred Partition, including, locality sensitive hashing, random forest, and many other random hashing methods, etc. Moreover, we also extend the theory to study how to choose the parameters for indexing methods with random partitions. 3. For indexing with data-dependent partitions, I have applied the same formulation to develop a joint optimization approach with two important criteria: nearest neighbor preserving and region size balancing. we have applied the joint optimization to different partition structures such as hashing and clustering, and achieved several new nearest neighbor search methods, outperforming (or at least comparable) to state-of-the-art solutions for large scale nearest neighbor search. 4. we have further studied fundamental problems for nearest neighbor search beyond search methods, for example, what is the difficulty of nearest neighbor search on a given data set (independent of search methods)? What data properties affect the difficulty and how? How will the theoretical analysis and algorithm design of large scale nearest neighbor search problem be affected by the data set difficulty? 5. Finally, we have applied our nearest neighbor search methods for practical applications. We focus on the development of large visual search engines using new indexing methods developed in this thesis. The techniques can be applied to other domains with data-intensive applications, and moreover, be extended to other applications beyond visual search engine, such as large scale machine learning, data mining, and social network analysis, etc.
机译:我们正在见证一个数据爆炸时代,在这个时代,可以在Web,企业数据中心,监视传感器系统等上轻松找到由高维特征向量表示的数十亿个或更多样本的巨大数据集。在这些大规模数据集上,最近邻居搜索对于许多应用程序来说都是至关重要的,包括基于内容的搜索/检索,推荐,聚类,图形和社交网络研究以及许多其他机器学习和数据挖掘问题。穷举搜索是最邻近搜索的最简单,最直接的方法,但是它不能按上述大小扩展到庞大的数据集。为了使大规模最近邻搜索切实可行,我们需要在线搜索步骤在数据库大小方面是亚线性的,这意味着离线索引是必要的。而且,为了获得亚线性搜索时间,我们通常需要牺牲搜索精度,因此我们通常只能获得近似的最近邻居而不是精确的最近邻居。换句话说,通过大规模的最近邻居搜索,我们的目标是通过离线索引以次线性在线搜索时间实现近似最近邻居搜索方法。在某种程度上,可以通过将特征空间划分为不同的区域,并将每个点映射到其壁橱区域,来为(亚线性时间)近似搜索索引矢量数据集。分区结构有不同种类,例如基于树的分区,基于哈希的分区,基于聚类/量化的分区等。从如何生成数据分区功能的角度来看,分区方法可以分为两大类: 1.数据无关(随机)分区,例如局部敏感哈希,随机树/林方法等; 2.与数据相关的(优化)分区,例如紧凑哈希,基于量化的索引方法以及一些基于树的方法,例如kd-tree,pca树等。对于离线索引/分区,在线近似最近邻居搜索通常由三个组成步骤:找到查询点所属的查询区域,获取查询区域附近区域中的数据库点候选,并对候选进行排名/返回。对于大规模最近邻搜索,关键问题是:如何设计最佳的离线索引,以使在线搜索性能最佳,或更具体地说,在线搜索可以在满足所需精度的情况下尽可能快地进行?在本文中,我们研究了大规模数据集上(近似)最近邻搜索的理论,算法,系统和应用,包括随机分区索引和基于学习分区的索引。我们的主要主要贡献是:1.我们将各种最近的邻居搜索方法统一到数据分区框架中,并提供了最佳数据分区的一般表述,它可以支持最快的搜索速度,同时满足所需的搜索精度。该表述是通用的,可用于解释大多数现有的(亚线性)大规模近似最近邻搜索方法。 2.对于使用与数据无关的分区进行索引,我们基于最佳的数据分区公式,针对时间和空间复杂度的上下限开发了理论。该范围适用于称为“最近邻居首选哈希”和“最近邻居首选分区”的一般方法,包括局部敏感哈希,随机森林和许多其他随机哈希方法等。此外,我们还扩展了该理论以研究如何选择具有随机分区的索引方法的参数。 3.对于使用数据相关分区的索引,我采用了相同的公式来开发具有两个重要标准的联合优化方法:最近邻居保留和区域大小平衡。我们已将联合优化应用于不同的分区结构(例如散列和聚类),并实现了几种新的最近邻搜索方法,其性能优于(或至少可与之相比)用于大规模最近邻搜索的最新解决方案。 4.除了搜索方法之外,我们还研究了最邻近搜索的基本问题,例如,对给定数据集(与搜索方法无关)进行最邻近搜索的困难是什么?哪些数据属性会影响难度以及如何影响?数据集难度如何影响大规模最近邻搜索问题的理论分析和算法设计? 5.最后,我们将最近邻搜索方法应用于实际应用。我们致力于使用本文中开发的新索引方法开发大型视觉搜索引擎。该技术可以应用于数据密集型应用程序的其他领域,此外,还可以扩展到视觉搜索引擎之外的其他应用程序,例如大规模机器学习。,数据挖掘和社交网络分析等。

著录项

  • 作者

    He Junfeng;

  • 作者单位
  • 年度 2014
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"en","name":"English","id":9}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号