随着互联网用户及内容的指数级增长,大规模数据场景下的杰卡德相似系数计算对算法的效率提出了更高的要求.为提高算法的执行效率,对MapReduce架构下的算法执行缺陷进行了分析,结合Spark适用于迭代型及交互型任务的特点,基于二维划分算法将算法从MapReduce平台移植到Spark平台;并通过参数调整、内存优化等方法进一步提高了算法的执行效率.两组数据集分别在3组不同规模的集群上的实验结果表明,与MapReduce相比,Spark平台下的算法执行效率提高了4倍以上,能耗效率提升了3倍以上.%With the exponential growth of Internet users and content,the efficiency of the Jaccard's similarity coefficient algorithm under big data scenario is more important than ever before.In order to improve the efficiency of Jaccard's similarity computing process,the implementation that the algorithm was analyzed under MapReduce architecture.Combining the characteristics of the Spark is more suitable for the iterative and interactive tasks,we transformed the algorithm from the MapReduce platform to Spark based on two dimensional partition algorithm.And we improved the efficiency of the algorithm by parameter adjustment,memory optimization and other.methods.With two data sets running on 3 clusters with different number of datanodes,the experimental results show that,compared with MapReduce,the algorithrn execution efficiency under Spark platform improves more than 4 times,and energy efficiency improves more than 3 times.
展开▼