首页> 中文期刊> 《软件》 >基于Hadoop的微博热点话题发现的聚类算法

基于Hadoop的微博热点话题发现的聚类算法

         

摘要

针对海量微博数据无法高速、精准发现热点话题的问题,基于Hadoop大数据处理技术,提出了一种面向微博热点话题发现的文本聚类算法。利用大数据处理平台 Hadoop 下开源机器学习软件库 Mahout,将文本聚类和热点话题相结合,对基于余弦距离测度的K-means算法进行改进,通过对不同区间范围的余弦距离进行适当的增大或缩小,提高了微博热点话题聚类结果的簇内聚集度和簇间分离度。实验结果表明,采用修改余弦距离的改进的K-means算法,微博热点话题聚类结果的簇内距离减少了2.72%,簇间距离增大了4.12%,召回率和准确率也分别提高了7%和6%,有效的提高了微博热点话题发现的聚类质量。%Aiming at the problem that Microblog data can not be found hot topic rapidly and accurately, a new text clustering algorithm for hot topic detection is proposed based on Big Data processing technology. Combining text clus-tering and hot topics, the K-means algorithm with cosine distance measure is modified by using data mining learning library Mahout which is under cloud computing platform Hadoop. By increasing or decreasing the cosine distance of different interval ranges appropriately, the new algorithm improves the intra-cluster aggregation and inter- cluster sepa-ration of microblog hot topic clustering result. The experimental results show that, the advanced K-means algorithm by modified cosine distance measure results in a better result comparing with the traditional K-means algorithm, in-tra-cluster is decreased by 2.72% and inter-cluster distance is increased by 4.12%, recall rate and accuracy are increased by 7% and 6% respectively, which improves the clustering quality of hot topic detection effectively.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号