On the Nystrom and Column-Sampling Methods for the Approximate Principal Components Analysis of Large Datasets

Homrighausen Darren; Mcdonald Daniel J.

首页> 外文期刊>Journal of computational and graphical statistics: A joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America >On the Nystrom and Column-Sampling Methods for the Approximate Principal Components Analysis of Large Datasets

【24h】

On the Nystrom and Column-Sampling Methods for the Approximate Principal Components Analysis of Large Datasets

机译：大数据集近似主成分分析的Nystrom和列采样方法

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this article, we analyze approximate methods for undertaking a principal components analysis (PCA) on large datasets. PCA is a classical dimension reduction method that involves the projection of the data onto the subspace spanned by the leading eigen-vectors of the covariance matrix. This projection can be used either for exploratory purposes or as an input for further analysis, for example, regression. If the data have billions of entries or more, the computational and storage requirements for saving and manipulating the design matrix in fast memory are prohibitive. Recently, the Nystrom and column-sampling methods have appeared in the numerical linear algebra community for the randomized approximation of the singular value decomposition of large matrices. However, their utility for statistical applications remains unclear. We compare these approximations theoretically by bounding the distance between the induced sub-spaces and the desired, but computationally infeasible, PCA subspace. Additionally we show empirically, through simulations and a real data example involving a corpus of emails, the trade-off of approximation accuracy and computational complexity.

机译：在本文中，我们分析了对大型数据集进行主成分分析（PCA）的近似方法。 PCA是一种经典的降维方法，涉及将数据投影到由协方差矩阵的前导特征向量跨越的子空间上。此投影可以用于探索目的，也可以用作进一步分析（例如回归）的输入。如果数据具有数十亿或更多的条目，则用于在快速存储器中保存和处理设计矩阵的计算和存储需求将是令人望而却步的。最近，在线性线性代数社区中出现了Nystrom和列采样方法，用于随机估计大矩阵的奇异值分解。但是，它们在统计应用中的效用仍不清楚。我们通过限制诱导子空间与所需但在计算上不可行的PCA子空间之间的距离，在理论上比较这些近似值。另外，我们通过仿真和涉及电子邮件语料库的真实数据示例，以经验方式展示了近似精度和计算复杂性之间的权衡。

著录项

来源
《Journal of computational and graphical statistics: A joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America》 |2016年第2期|共19页
作者
Homrighausen Darren; Mcdonald Daniel J.;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类应用统计数学;
关键词
Big data; Randomized algorithms; Subspace distance;

机译：大数据;随机化算法;子空间距离;

相似文献

外文文献
中文文献
专利

1. On the Nystrom and Column-Sampling Methods for the Approximate Principal Components Analysis of Large Datasets [J] . Homrighausen Darren, Mcdonald Daniel J. Journal of computational and graphical statistics: A joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America . 2016,第2期

机译：大数据集近似主成分分析的Nystrom和列采样方法
2. Determining the number of components in principal components analysis: A comparison of statistical, crossvalidation and approximated methods [J] . Saccenti Edoardo, Camacho Jose Chemometrics and Intelligent Laboratory Systems . 2015,第Pta1期

机译：确定主成分分析中的成分数：统计，交叉验证和近似方法的比较
3. CLUSTERING OF HIGH DIMENSIONAL DATASET USING K-MAM (MAX-AVG-MIN) METHOD WITH PRINCIPAL COMPONENT ANALYSIS A HYBRID APPROACH [J] . S.DHANABAL, DR.S.CHANDRAMATHI Journal of Theoretical and Applied Information Technology . 2014,第1期

机译：使用K-MAM（MAX-AVG-MIN）方法具有主成分分析的高维数据集进行混合方法
4. A hybridized method for clustering datasets using principal components, selection and rejection methods [C] . Jozelle Addawe, Lee Javellana Innovation and Analytics Conference amp;amp;amp; Exhibition . 2019

机译：一种使用主组件，选择和拒绝方法进行聚类数据集的杂交方法
5. Design of a face recognition system using incremental principal component and independent component analysis (IPCA-ICA) methods. [D] . Patel, Rajeshkumar. 2013

机译：使用增量主成分和独立成分分析（IPCA-ICA）方法设计人脸识别系统。
6. GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis [O] . Yumi Jin, Alejandro A. Schaffer, Michael Feolo, 2019

机译：GRAF-pop：一种基于快速距离的方法无需主成分分析即可从多个基因型数据集推断受试者的祖先
7. On the Nystr"om and Column-Sampling Methods for the Approximate Principal Components Analysis of Large Data Sets [O] . Homrighausen, Darren, McDonald, Daniel J. 2016

机译：关于逼近的Nystr \“om和Column-sampling方法大数据集的主成分分析

On the Nystrom and Column-Sampling Methods for the Approximate Principal Components Analysis of Large Datasets

摘要

著录项

相似文献

相关主题

期刊订阅