首页> 外文会议>IEEE International Conference on Cluster Computing >Fast Scalable Approximate Nearest Neighbor Search for High-dimensional Data
【24h】

Fast Scalable Approximate Nearest Neighbor Search for High-dimensional Data

机译:高维数据的快速可扩展近似最近邻居搜索

获取原文

摘要

K-Nearest Neighbor (k-NN) search is one of the most commonly used approaches for similarity search. It finds extensive applications in machine learning and data mining. This era of big data warrants efficiently scaling k-NN search algorithms for billion-scale datasets with high dimensionality. In this paper, we propose a solution towards this end where we use vantage point trees for partitioning the dataset across multiple processes and exploit an existing graph-based sequential approximate k-NN search algorithm called HNSW (Hierarchical Navigable Small World) for searching locally within a process. Our hybrid MPI-OpenMP solution employs techniques including exploiting MPI one-sided communication for reducing communication times and partition replication for better load balancing across processes. We demonstrate computation of k-NN for 10,000 queries in the order of seconds using our approach on ∼8000 cores on a dataset with billion points in an 128-dimensional space. We also show 10X speedup over a completely k-d tree-based solution for the same dataset, thus demonstrating better suitability of our solution for high dimensional datasets. Our solution shows almost linear strong scaling,
机译:K最近邻(k-NN)搜索是最常用的相似性搜索方法之一。它在机器学习和数据挖掘中找到了广泛的应用。大数据时代确保了对数十亿规模的高维度数据集有效地缩放k-NN搜索算法。在本文中,我们为此目的提出了一种解决方案,其中我们使用优势点树在多个过程之间划分数据集,并利用一种称为HNSW(Hierarchical Navigable Small World)的现有基于图的顺序近似k-NN搜索算法在本地进行搜索一个过程。我们的混合MPI-OpenMP解决方案采用的技术包括利用MPI单侧通信来减少通信时间,并使用分区复制来更好地跨进程进行负载平衡。我们使用在128维空间中具有十亿个点的数据集上的8000个核心上的方法,以秒为单位演示了10,000个查询的k-NN计算。对于相同的数据集,我们还展示了基于完全k-d树的解决方案的10倍加速,从而证明了我们的解决方案对高维数据集的更好适用性。我们的解决方案显示出几乎线性的强缩放,

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号