首页> 外文会议>International Conference on Data Science and Advanced Analytics >Projecting 'Better Than Randomly': How to Reduce the Dimensionality of Very Large Datasets in a Way That Outperforms Random Projections
【24h】

Projecting 'Better Than Randomly': How to Reduce the Dimensionality of Very Large Datasets in a Way That Outperforms Random Projections

机译:投影“比随机更好”:如何以胜过随机投影的方式降低非常大的数据集的维度

获取原文

摘要

For very large datasets, random projections (RP) have become the tool of choice for dimensionality reduction. This is due to the computational complexity of principal component analysis. However, the recent development of randomized principal component analysis (RPCA) has opened up the possibility of obtaining approximate principal components on very large datasets. In this paper, we compare the performance of RPCA and RP in dimensionality reduction for supervised learning. In Experiment 1, study a malware classification task on a dataset with over 10 million samples, almost 100,000 features, and over 25 billion non-zero values, with the goal of reducing the dimensionality to a compressed representation of 5,000 features. In order to apply RPCA to this dataset, we develop a new algorithm called large sample RPCA (LS-RPCA), which extends the RPCA algorithm to work on datasets with arbitrarily many samples. We find that classification performance is much higher when using LS-RPCA for dimensionality reduction than when using random projections. In particular, across a range of target dimensionalities, we find that using LS-RPCA reduces classification error by between 37% and 54%. Experiment 2 generalizes the phenomenon to multiple datasets, feature representations, and classifiers. These findings have implications for a large number of research projects in which random projections were used as a preprocessing step for dimensionality reduction. As long as accuracy is at a premium and the target dimensionality is sufficiently less than the numeric rank of the dataset, randomized PCA may be a superior choice. Moreover, if the dataset has a large number of samples, then LS-RPCA will provide a method for obtaining the approximate principal components.
机译:对于非常大的数据集,随机投影(RP)已成为维数减少的选择工具。这是由于主成分分析的计算复杂性。然而,最近的随机主成分分析(RPCA)的发展已经开辟了在非常大的数据集上获得近似主成分的可能性。在本文中,我们比较RPCA和RP在监督学习的维度减少中的性能。在实验1中,研究数据集的恶意软件分类任务,具有超过1000万个样本,近100,000个功能,超过250亿个非零值,其目标将维数减少到5,000个功能的压缩表示。为了将RPCA应用于此数据集,我们开发了一种名为大样本RPCA(LS-RPCA)的新算法,其扩展了RPCA算法在具有任意许多样本的数据集上工作。我们发现,当使用LS-RPCA时,分类性能远高得多维数减少,而不是使用随机投影时。特别是,在一系列目标尺寸方面,我们发现使用LS-RPCA将分类误差降低37%和54%。实验2概括了多个数据集,特征表示和分类器的现象。这些发现对大量研究项目产生了影响,其中将随机投影用作维度降低的预处理步骤。只要准确性为溢价并且目标维度足够小于数据集的数字等级,随机PCA可能是一个优越的选择。此外,如果数据集具有大量样本,则LS-RPCA将提供用于获得近似主组件的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号