基于Spark的分层子空间权重树随机森林算法

牛志华; 屈景怡; 吴仁彪

首页> 中文期刊> 《信号处理》 >基于Spark的分层子空间权重树随机森林算法

基于Spark的分层子空间权重树随机森林算法

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

高维数据的很多特征与类别的相关性弱,影响了随机森林的分类正确率.针对原始随机森林算法在高维数据上的分类问题,提出了一种分层子空间权重树随机森林算法.同时,传统的单机模式无法满足高维数据计算效率的需求,因此利用开源集群计算框架Spark在内存缓存和迭代计算上的优势,将所提算法在Spark上实现.所提算法采用以决策树为单位的分层抽样来生成特征子空间,在提高单棵决策树性能的同时,保证决策树之间的多样性;并且采用权重树的集成策略,使分类能力强的树在集成过程中影响力更大.通过在Mnist和Gi-sette数据集上的实验结果表明,相比原始随机森林算法、TWRF算法以及分层子空间随机森林算法,所提算法具有更好的正确率,提高了泛化误差性能,可扩展性良好,能够有效分类高维数据.%For high dimensional data,a large portion of features are often not informative of the class of the objects,which affects the classification accuracy of the original random forest algorithm.In order to deal with the classification problem of the original random forest algorithm on high dimensional data,a random forest algorithm using stratified subspaces and weighted trees was proposed.Meanwhile,the traditional single-machine mode cannot meet the needs of computational efficiency of high dimensional data.Spark is a new cluster-computing framework.Therefore,the proposed algorithm was implemented on Spark to use its advantages in memory cache and iterative computation.In the paper,the decision tree was treated as a unit to adopt stratified sampling to generate feature subspaces,which could improve the performance of the decision trees among the forest and could ensure the diversity of them.Meanwhile,the integration strategy of weighted trees was used to make the trees with strong classification ability more influential in the integration process.The experiments on Mnist dataset and Gisette dataset show that the proposed algorithm has better performance than the original random forest algorithm and other two algorithms and has good scalability.The proposed algorithm could be an effective method for classifying high dimensional data.

著录项

来源
《信号处理》 |2017年第10期|1301-1307|共7页
作者
牛志华; 屈景怡; 吴仁彪;
展开▼
作者单位

中国民航大学天津市智能信号与图像处理重点实验室,天津300300;

中国民航大学天津市智能信号与图像处理重点实验室,天津300300;

中国民航大学天津市智能信号与图像处理重点实验室,天津300300;

展开▼
原文格式 PDF
正文语种 chi
中图分类算法理论;
关键词
高维数据; 随机森林算法; 决策树; 分层抽样; 权重树; Spark;

相似文献

中文文献
外文文献
专利

1. Spark平台加权分层子空间随机森林算法研究 [J] . 荆静 ,祝永志 . 软件导刊 . 2020,第003期
2. 基于Spark的权重树随机森林算法 [J] . 牛志华 . 中国科技信息 . 2017,第013期
3. 基于Spark的多阶空间权重矩阵STARIMA交通流预测分析方法 [J] . 李欣 . 中山大学学报（自然科学版） . 2018,第006期
4. 基于动作子空间和权重条件随机场的行为识别 [J] . 王智文 ,蒋联源 ,王宇航 . 电子科技大学学报 . 2017,第002期
5. 基于空间约束分层树模型的彩色图像分割 [J] . 李建华 ,李俊山 ,陈霞 . 微电子学与计算机 . 2012,第12期
6. 基于Spark的决策树优化算法在脑卒中发病率预测的应用 [C] . SONG Jing ,宋晶 ,QI Xiao . 2019年西南三省一市自动化与仪器仪表学术年会 . 2019
7. 基于分层子空间的分布式随机森林算法优化 [A] . 荆静 . 2020

基于Spark的分层子空间权重树随机森林算法

摘要

著录项

相似文献

相关主题

期刊订阅