首页> 外文学位 >Efficient computation of k-nearest neighbor graphs for large high-dimensional data sets on gpu clusters.
【24h】

Efficient computation of k-nearest neighbor graphs for large high-dimensional data sets on gpu clusters.

机译:有效计算gpu群集上的大型高维数据集的k最近邻图。

获取原文
获取原文并翻译 | 示例

摘要

The k-Nearest Neighbor Graph (k-NNG) and the related k-Nearest Neighbor (k-NN) methods have a wide variety of applications in areas such as bioinformatics, machine learning, data mining, clustering analysis, and pattern recognition. Our application of interest is manifold embedding. Due to the large dimensionality of the input data ( <15k), spatial subdivision based techniques such OBBs, k-d tree, BSP etc., are not viable. The only alternative is the brute-force search, which has two distinct parts. The first finds distances between individual vectors in the corpus based on a pre-defined metric. Given the distance matrix, the second step selects k nearest neighbors for each member of the query data set. This thesis presents the development and implementation of a distributed exact k-Nearest Neighbor Graph (k-NNG) construction method. The proposed method uses Graphics Processing Units (GPUs) and exploits multiple levels of parallelism for distributed computational systems using GPUs. It is scalable for different cluster sizes, with each compute node in the cluster containing multiple GPUs. The distance computation is formulated as a basic matrix multiplication and reduction operation. The optimized CUBLAS matrix multiplication library is used for this purpose. Various distance metrics such as Euclidian, cosine, and Pearson are supported. For k-NNG construction, two different methods are presented. The first is based on an approach called batch index sorting to build the k-NNG with three sorting operations. This method uses the optimized radix sort implementation in the Thrust library for GPU. The second is an efficient implementation using the latest GPU functionalities of a variant of the quick select algorithm. Overall, the batch index sorting based k-NNG method is approximately 13x faster than a distributed MATLAB implementation. The quick select algorithm itself has a 5x speedup over state-of-the art GPU methods. This has enabled the processing of k-NNG construction on a data set containing 20 million image vectors, each with dimension 15,000, as part of a manifold embedding technique for analyzing the conformations of biomolecules.
机译:k最近邻图(k-NNG)和相关k最近邻方法(k-NN)在生物信息学,机器学习,数据挖掘,聚类分析和模式识别等领域具有广泛的应用。我们感兴趣的应用是流形嵌入。由于输入数据的维度大(<15k),基于空间细分的技术(如OBB,k-d树,BSP等)不可行。唯一的选择是蛮力搜索,它有两个不同的部分。首先根据预先定义的指标找到语料库中各个向量之间的距离。给定距离矩阵,第二步为查询数据集的每个成员选择k个最近的邻居。本文提出了一种分布式精确k最近邻图(k-NNG)构造方法的开发与实现。所提出的方法使用图形处理单元(GPU),并为使用GPU的分布式计算系统开发了多层并行性。它可以针对不同的集群规模进行扩展,集群中的每个计算节点都包含多个GPU。距离计算公式化为基本矩阵乘法和约简运算。为此,使用了优化的CUBLAS矩阵乘法库。支持各种距离度量标准,例如欧几里得,余弦和皮尔逊。对于k-NNG构造,提出了两种不同的方法。第一种基于称为批索引排序的方法,可通过三个排序操作来构建k-NNG。该方法使用Thrust库中用于GPU的优化基数排序实现。第二个是使用快速选择算法变体的最新GPU功能的高效实现。总体而言,基于批处理索引排序的k-NNG方法比分布式MATLAB实现快约13倍。快速选择算法本身的速度是最先进的GPU方法的5倍。这是对包含2000万个图像矢量的数据集进行k-NNG构造的处理,每个图像矢量的尺寸为15,000,作为用于分析生物分子构象的多种嵌入技术的一部分。

著录项

  • 作者

    Dashti, Ali.;

  • 作者单位

    The University of Wisconsin - Milwaukee.;

  • 授予单位 The University of Wisconsin - Milwaukee.;
  • 学科 Biomedical engineering.;Computer science.
  • 学位 M.S.
  • 年度 2013
  • 页码 67 p.
  • 总页数 67
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号