微博热点话题发现是目前的研究热点。针对传统热词抽取方法难以适用于微博数据的问题,提出一种基于老化理论的词生命值计算模型用于热词抽取,并基于热词间的相关性构建词共现网络;针对传统的词聚类算法不能较好地解决话题间存在重叠热词以及时间效率不佳的问题,引入多标签传播思想,设计一种接近线性时间复杂度的多标签传播聚类算法( TCMLPA)用于词共现网络的热词聚类,获得热点话题集。实验结果表明,词生命值计算模型能够有效过滤噪声并提取热词,TCMLPA算法则能够在保证聚类结果稳定性的情况下,有效提高热点话题发现的精度和效率。%With the rapid growth of microblog data, extracting hot topics from vast amounts of microblog posts has become a research hotspot. The traditional methods for hot term extraction can hardly apply to microblog data, thus a life value calculation model based on aging theory is established to extract hot terms. Then, a hot term co-occurrence network is built based on the correlations between hot terms. Aiming at the problem that traditional clustering methods can hardly handle the hot term overlap between different topics and can not deal with vast amounts of data efficiently, a term clustering method based on multi-label propagation algorithm ( TCMLPA) , which has a nearly linear time complexity, is proposed to detect hot topics in hot term co-occurrence network. The experimental results show that life value calculation model can filter noise and extract hot terms effectively. Meanwhile, TCMLPA ensures the stability of clustering result and improves the accuracy and efficiency of hot topic detection.
展开▼