Aiming at the problem of too high occupancy of communication time and limited applying value of the algorithm under the mechanism of Mapreduce,we put forward a Hadoop-based two-stage parallel c-Means clustering algorithm to deal with the problem of extra-large data classification.First,we improved the MPI communication management method in Mapreduce mechanism,and used membership management protocol mode to realise the synchronisation of members management and Mapreduce reducing operation.Secondly, we implemented typical individuals group reducing operation instead of global individual reducing operation,and defined the two-stage buffer algorithm.Finally,through the buffer in first stage we further reduced the data amount of Mapreduce operation in second stage,and reduced the negative impact brought about by big data on the algorithm as much as possible.Based on this,we carried out the simulation by using artificial big data test set and KDD CUP 99 invasion test data.Experimental result showed that the algorithm could both guarantee the clustering precision requirement and speed up effectively the operation efficiency of algorithm.%针对Mapreduce机制下算法通信时间占用比过高,实际应用价值受限的情况,提出基于Hadoop二阶段并行c-Means聚类算法用来解决超大数据的分类问题。首先,改进Mapreduce机制下的MPI通信管理方法,采用成员管理协议方式实现成员管理与Mapreduce降低操作的同步化;其次,实行典型个体组降低操作代替全局个体降低操作,并定义二阶段缓冲算法;最后,通过第一阶段的缓冲进一步降低第二阶段Mapreduce操作的数据量,尽可能降低大数据带来的对算法负面影响。在此基础上,利用人造大数据测试集和KDD CUP 99入侵测试集进行仿真,实验结果表明,该算法既能保证聚类精度要求又可有效加快算法运行效率。
展开▼