首页> 外文期刊>Machine Learning >A greedy feature selection algorithm for Big Data of high dimensionality
【24h】

A greedy feature selection algorithm for Big Data of high dimensionality

机译:高维大数据的贪婪特征选择算法

获取原文
       

摘要

We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.
机译:我们针对高维大数据,提出了一种用于修剪的并行,前向后修剪(PFBP)算法。 PFBP按照行和列对数据矩阵进行分区。通过使用条件独立性测试的p值和元分析技术的概念,PFBP仅依赖于分区局部的计算,同时将通信成本降至最低,从而使计算大规模并行化。用于组合局部计算的类似技术也用于创建最终的预测模型。 PFBP使用渐近声音启发式方法做出早期近似决策,例如在后续迭代中考虑从要素的早期删除,在同一迭代中考虑要素的尽早停止或每次迭代中获胜者的早日回归。 PFBP为因果网络(贝叶斯网络或最大祖先图)如实表示的数据分布提供了最优的渐近保证。经验分析证实,随着样本数量的增加,相对于功能部件和处理内核数量的线性可扩展性,该算法实现了超线性加速。广泛的比较评估还证明了PFBP相对于其同类中的其他算法的有效性。提出的启发式方法是通用的,并且有可能被其他贪婪类型的FS算法采用。作为一个用例,提供了一个具有500K样本的模拟单核苷酸多态性(SNP)数据的应用程序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号