...
【24h】

On the Nystrom and Column-Sampling Methods for the Approximate Principal Components Analysis of Large Datasets

机译:大数据集近似主成分分析的Nystrom和列采样方法

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

In this article, we analyze approximate methods for undertaking a principal components analysis (PCA) on large datasets. PCA is a classical dimension reduction method that involves the projection of the data onto the subspace spanned by the leading eigen-vectors of the covariance matrix. This projection can be used either for exploratory purposes or as an input for further analysis, for example, regression. If the data have billions of entries or more, the computational and storage requirements for saving and manipulating the design matrix in fast memory are prohibitive. Recently, the Nystrom and column-sampling methods have appeared in the numerical linear algebra community for the randomized approximation of the singular value decomposition of large matrices. However, their utility for statistical applications remains unclear. We compare these approximations theoretically by bounding the distance between the induced sub-spaces and the desired, but computationally infeasible, PCA subspace. Additionally we show empirically, through simulations and a real data example involving a corpus of emails, the trade-off of approximation accuracy and computational complexity.
机译:在本文中,我们分析了对大型数据集进行主成分分析(PCA)的近似方法。 PCA是一种经典的降维方法,涉及将数据投影到由协方差矩阵的前导特征向量跨越的子空间上。此投影可以用于探索目的,也可以用作进一步分析(例如回归)的输入。如果数据具有数十亿或更多的条目,则用于在快速存储器中保存和处理设计矩阵的计算和存储需求将是令人望而却步的。最近,在线性线性代数社区中出现了Nystrom和列采样方法,用于随机估计大矩阵的奇异值分解。但是,它们在统计应用中的效用仍不清楚。我们通过限制诱导子空间与所需但在计算上不可行的PCA子空间之间的距离,在理论上比较这些近似值。另外,我们通过仿真和涉及电子邮件语料库的真实数据示例,以经验方式展示了近似精度和计算复杂性之间的权衡。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号