首页> 外文会议>ACMKDD International Conference on Knowledge Discovery and Data Mining;KDD 2008 >Unsupervised Feature Selection for Principal Components Analysis
【24h】

Unsupervised Feature Selection for Principal Components Analysis

机译:主成分分析的无监督特征选择

获取原文

摘要

Principal Components Analysis (PCA) is the predominant linear dimensionality reduction technique, and has been widely applied on datasets in all scientific domains. We consider, both theoretically and empirically, the topic of unsupervised feature selection for PCA, by leveraging algorithms for the so-called Column Subset Selection Problem (CSSP). In words, the CSSP seeks the "best" subset of exactly k columns from an m×n data matrix A, and has been extensively studied in the Numerical Linear Algebra community. We present a novel two-stage algorithm for the CSSP. From a theoretical perspective, for small to moderate values of k, this algorithm significantly improves upon the best previously-existing results [24, 12] for the CSSP. From an empirical perspective, we evaluate this algorithm as an unsupervised feature selection strategy in three application domains of modern statistical data analysis: finance, document-term data, and genetics. We pay particular attention to how this algorithm may be used to select representative or landmark features from an object-feature matrix in an unsupervised manner. In all three application domains, we are able to identify k landmark features, i.e., columns of the data matrix, that capture nearly the same amount of information as does the subspace that is spanned by the top k "eigenfeatures."
机译:主成分分析(PCA)是主要的线性降维技术,已广泛应用于所有科学领域的数据集。我们在理论上和经验上都通过利用针对所谓的“列子集选择问题”(CSSP)的算法来考虑PCA的无监督特征选择这一主题。换句话说,CSSP从m×n数据矩阵A中寻找恰好k列的“最佳”子集,并且在数值线性代数社区中进行了广泛的研究。我们为CSSP提出了一种新颖的两阶段算法。从理论上讲,对于k的中小值,该算法显着改善了CSSP的最佳现有结果[24,12]。从经验角度来看,我们将此算法评估为现代统计数据分析的三个应用领域中的无监督特征选择策略:财务,文档项数据和遗传学。我们特别注意如何使用该算法以无监督的方式从对象特征矩阵中选择代表特征或地标特征。在所有三个应用领域中,我们能够识别出k个界标特征,即数据矩阵的列,它们捕获的信息量与前k个“特征”所跨越的子空间几乎相同。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号