基于Hadoop的微博热点话题发现的聚类算法

彭玉青; 高红灿; 张媛媛; 董良

首页> 中文期刊> 《软件》 >基于Hadoop的微博热点话题发现的聚类算法

基于Hadoop的微博热点话题发现的聚类算法

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

针对海量微博数据无法高速、精准发现热点话题的问题，基于Hadoop大数据处理技术，提出了一种面向微博热点话题发现的文本聚类算法。利用大数据处理平台 Hadoop 下开源机器学习软件库 Mahout，将文本聚类和热点话题相结合，对基于余弦距离测度的K-means算法进行改进，通过对不同区间范围的余弦距离进行适当的增大或缩小，提高了微博热点话题聚类结果的簇内聚集度和簇间分离度。实验结果表明，采用修改余弦距离的改进的K-means算法，微博热点话题聚类结果的簇内距离减少了2.72%，簇间距离增大了4.12%，召回率和准确率也分别提高了7%和6%，有效的提高了微博热点话题发现的聚类质量。%Aiming at the problem that Microblog data can not be found hot topic rapidly and accurately, a new text clustering algorithm for hot topic detection is proposed based on Big Data processing technology. Combining text clus-tering and hot topics, the K-means algorithm with cosine distance measure is modified by using data mining learning library Mahout which is under cloud computing platform Hadoop. By increasing or decreasing the cosine distance of different interval ranges appropriately, the new algorithm improves the intra-cluster aggregation and inter- cluster sepa-ration of microblog hot topic clustering result. The experimental results show that, the advanced K-means algorithm by modified cosine distance measure results in a better result comparing with the traditional K-means algorithm, in-tra-cluster is decreased by 2.72% and inter-cluster distance is increased by 4.12%, recall rate and accuracy are increased by 7% and 6% respectively, which improves the clustering quality of hot topic detection effectively.

著录项

来源
《软件》 |2016年第10期|46-50|共5页
作者
彭玉青; 高红灿; 张媛媛; 董良;
展开▼
作者单位

河北工业大学计算机科学与软件学院;

天津 300401;

河北工业大学计算机科学与软件学院;

天津 300401;

河北工业大学计算机科学与软件学院;

天津 300401;

河北工业大学计算机科学与软件学院;

天津 300401;

展开▼
原文格式 PDF
正文语种 chi
中图分类理论、方法;
关键词
话题发现; K-means聚类算法; 簇内距离; 簇间距离; Hadoop; Mahout;

相似文献

中文文献
外文文献
专利

1. 基于Hadoop微博热点话题挖掘系统的设计与实现 [J] . 杨浩 ,曾兴斌 ,何加铭 . 数据通信 . 2016,第002期
2. 基于Hadoop的微博热点话题挖掘系统研究与设计 [J] . 陆瑶 ,李振婷 . 电子商务 . 2014,第009期
3. 基于中心词和LDA的微博热点话题发现研究 [J] . 刘干 ,林杰豪 ,翟雯熠 . 情报杂志 . 2021,第005期
4. 基于两层聚类的微博热点话题发现算法研究 [J] . 李勇 . 自动化技术与应用 . 2021,第011期
5. 基于文本双表示模型的微博热点话题发现 [J] . 刘梦颖 ,王勇 . 计算机与现代化 . 2021,第012期
6. 基于微博的热点话题发现模型研究 [C] . He yuan ,贺源 ,Zhang Cuixiao . SCEG2014研讨会（2014年“计算机科学与技术及教育技术“学术研讨会） . 2014
7. 基于Hadoop的中文微博热点话题发现方法研究 [A] . 王伟超 . 2016

基于Hadoop的微博热点话题发现的聚类算法

摘要

著录项

相似文献

相关主题

期刊订阅