首页> 外文会议>IEEE High Performance Extreme Computing Conference >Too many secants: a hierarchical approach to secant-based dimensionality reduction on large data sets
【24h】

Too many secants: a hierarchical approach to secant-based dimensionality reduction on large data sets

机译:割线太多:大数据集上基于割线的降维的分层方法

获取原文

摘要

A fundamental question in many data analysis settings is the problem of discerning the “natural” dimension of a data set. That is, when a data set is drawn from a manifold (possibly with noise), a meaningful aspect of the data is the dimension of that manifold. Various approaches exist for estimating this dimension, such as the method of Secant-Avoidance Projection (SAP). Intuitively, the SAP algorithm seeks to determine a projection which best preserves the lengths of all secants between points in a data set; by applying the algorithm to find the best projections to vector spaces of various dimensions, one may infer the dimension of the manifold of origination. That is, one may learn the dimension at which it is possible to construct a diffeomorphic copy of the data in a lower-dimensional Euclidean space. Using Whitney's embedding theorem, we can relate this information to the natural dimension of the data. A drawback of the SAP algorithm is that a data set with T points has O(T2) secants, making the computation and storage of all secants infeasible for very large data sets. In this paper, we propose a novel algorithm that generalizes the SAP algorithm with an emphasis on addressing this issue. That is, we propose a hierarchical secant-based dimensionality-reduction method, which can be employed for data sets where explicitly calculating all secants is not feasible.
机译:许多数据分析设置中的一个基本问题是辨别数据集“自然”维度的问题。也就是说,从歧管(可能带有噪声)中提取数据集时,数据的有意义的方面是该歧管的尺寸。存在多种用于估计该维度的方法,例如,割割避免投影(SAP)的方法。直观上,SAP算法试图确定一个投影,该投影能最好地保留数据集中各点之间所有割线的长度。通过应用该算法找到各种尺寸的向量空间的最佳投影,可以推断起源流形的尺寸。即,人们可以学习在较低维的欧几里得空间中构造数据的微分副本的维。使用惠特尼的嵌入定理,我们可以将此信息与数据的自然维度相关联。 SAP算法的缺点是具有T点的数据集具有O(T 2 )割线,因此无法对非常大的数据集进行所有割线的计算和存储。在本文中,我们提出了一种新颖的算法,该算法概括了SAP算法,并着重于解决此问题。也就是说,我们提出了一种基于割线的分层降维方法,该方法可用于显式计算所有割线不可行的数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号