...
首页> 外文期刊>Machine Learning >Massively parallel feature selection: an approach based on variance preservation
【24h】

Massively parallel feature selection: an approach based on variance preservation

机译:大规模并行特征选择:一种基于方差保留的方法

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Advances in computer technologies have enabled corporations to accumulate data at an unprecedented speed. Large-scale business data might contain billions of observations and thousands of features, which easily brings their scale to the level of terabytes. Most traditional feature selection algorithms are designed and implemented for a centralized computing architecture. Their usability significantly deteriorates when data size exceeds tens of gigabytes. High-performance distributed computing frameworks and protocols, such as the Message Passing Interface (MPI) and MapReduce, have been proposed to facilitate software development on grid infrastructures, enabling analysts to process large-scale problems efficiently. This paper presents a novel large-scale feature selection algorithm that is based on variance analysis. The algorithm selects features by evaluating their abilities to explain data variance. It supports both supervised and unsupervised feature selection and can be readily implemented in most distributed computing environments. The algorithm was implemented as a SAS High-Performance Analytics procedure, which can read data in distributed form and perform parallel feature selection in both symmetric multiprocessing mode (SMP) and massively parallel processing mode (MPP). Experimental results demonstrated the superior performance of the proposed method for large scale feature selection.
机译:计算机技术的进步使公司能够以前所未有的速度积累数据。大型业务数据可能包含数十亿个观测值和数千个功能,这很容易将其规模提高到TB级。大多数传统的特征选择算法都是为集中式计算体系结构设计和实现的。当数据大小超过数十GB时,它们的可用性会大大降低。已经提出了诸如消息传递接口(MPI)和MapReduce之类的高性能分布式计算框架和协议,以促进网格基础结构上的软件开发,从而使分析人员能够有效地处理大规模问题。本文提出了一种基于方差分析的新型大规模特征选择算法。该算法通过评估要素解释数据差异的能力来选择要素。它支持有监督和无监督功能选择,并且可以在大多数分布式计算环境中轻松实现。该算法被实现为SAS高性能分析程序,该程序可以以分布式形式读取数据并以对称多处理模式(SMP)和大规模并行处理模式(MPP)进行并行特征选择。实验结果证明了该方法在大规模特征选择中的优越性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号