...
首页> 外文期刊>Scientific reports. >Accurate Estimation of the Intrinsic Dimension Using Graph Distances: Unraveling the Geometric Complexity of Datasets
【24h】

Accurate Estimation of the Intrinsic Dimension Using Graph Distances: Unraveling the Geometric Complexity of Datasets

机译:使用图形距离准确估计内在尺寸:解开数据集的几何复杂度

获取原文
           

摘要

The collective behavior of a large number of degrees of freedom can be often described by a handful of variables. This observation justifies the use of dimensionality reduction approaches to model complex systems and motivates the search for a small set of relevant "collective" variables. Here, we analyze this issue by focusing on the optimal number of variable needed to capture the salient features of a generic dataset and develop a novel estimator for the intrinsic dimension (ID). By approximating geodesics with minimum distance paths on a graph, we analyze the distribution of pairwise distances around the maximum and exploit its dependency on the dimensionality to obtain an ID estimate. We show that the estimator does not depend on the shape of the intrinsic manifold and is highly accurate, even for exceedingly small sample sizes. We apply the method to several relevant datasets from image recognition databases and protein multiple sequence alignments and discuss possible interpretations for the estimated dimension in light of the correlations among input variables and of the information content of the dataset.
机译:大量自由度的集体行为通常可以通过少数变量来描述。该观察结果证明了使用维度降低方法来模拟复杂系统,并激励寻找一小组相关的“集体”变量。在这里,我们通过专注于捕获通用数据集的突出特征所需的最佳变量并为内部尺寸(ID)开发新颖的估算器来分析此问题。通过在图表上具有最小距离路径的大量测地仪,我们分析了最大值周围的成对距离的分布,并利用其对维度的依赖性以获得id估计。我们表明估计器不依赖于本征歧管的形状,并且是高度准确的,即使对于极小的样本尺寸也是如此。我们将该方法从图像识别数据库和蛋白质多个序列对齐应用到几个相关数据集,并且鉴于输入变量之间的相关性和数据集的信息内容,讨论估计尺寸的可能解释。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号