传统数据挖掘模式在处理海量、多维、复杂等特征的数据时,存在计算能力弱、效率低、可扩展性差等问题。论文提出基于 Map/Reduce 的决策树分类挖掘方法(C4.5BH 算法),该算法采用 K-means 聚类方法对连续属性进行离散化,并利用 Map/Reduce 编程模型和属性表结构实现了决策树构造过程中属性的并行计算和节点的并行分裂。实验证明,与传统的 C4.5算法相比,C4.5BH 算法在处理大规模数据集时具有更高的执行效率和良好的加速比。%The traditional data mining model is weak in computing power ,low efficiency and poor scalability when deal-ing with the data of massive ,multi-dimensional and complex characteristics .This paper proposes a mining method (C4 .5BH algorithm) based on Map/Reduce the decision tree classification ,which uses the K-means clustering method to discretize the continuous attributes and the Map/Reduce programming model and attribute table structure to achieve the parallel computa-tion of the attributes and the parallel splitting of nodes in the process of constructing decision tree .Experiments show that C4 .5BH algorithm has a higher efficiency and a better speedup when dealing with large data sets ,compared with the tradi-tional C4 .5 algorithm .
展开▼