首页> 中文期刊> 《计算机应用》 >基于特定领域的中文微博热点话题挖掘系统BTopicMiner

基于特定领域的中文微博热点话题挖掘系统BTopicMiner

         

摘要

As microblog application grows rapidly, how to extract users' interested popular topic from massive microblog information automatically becomes a challenging research area. This paper studied and proposed a topic extraction algorithm of Chinese microblog based on extended topic model. In order to deal with data sparse problem of microblog, the content related microblog text would be firstly clustered to generate synthetic document. Based on the assumption that posting relationship among microblogs implied topical correlation, the traditional LDA (Latent Dirichlet Allocation) topic model was extended to model the posting relationship among microblogs. At last, Mutual Information ( MI) measurement was utilized to calculate topic vocabulary after extracting topics by proposing extended LDA topic model for topic recommendation. Furthermore, a prototype system for domain-specific topical mining system, named BTopicMiner, was implemented so as to verify the effectiveness of the proposed algorithm. The experimental result shows dial the proposed algorithm can extract topics from microblogs more accurately. Meanwhile, the semantic similarity between automatically calculated topic vocabulary and manually selected topic vocabulary exceeds 75% while automatically calculating topic vocabulary based on MI.%随着微博应用的迅猛发展,自动地从海量微博信息中提取出用户感兴趣的热点话题成为一个具有挑战性的研究课题.为此研究并提出了基于扩展的话题模型的中文微博热点话题抽取算法.为了解决微博信息固有的数据稀疏性问题,算法首先利用文本聚类方法将内容相关的微博消息合成为微博文档;基于微博之间的跟帖关系蕴含着话题的关联性的假设,算法对传统潜在狄利克雷分配(LDA)话题模型进行扩展以建模微博之间的跟帖关系;最后利用互信息(MI)计算被抽取出的话题的话题词汇用于热点话题推荐.为了验证扩展的话题抽取模型的有效性,实现了一个基于特定领域的中文微博热点话题挖掘的原型系统——BTopicMiner.实验结果表明:基于微博跟帖关系的扩展话题模型可以更准确地自动提取微博中的热点话题,同时利用MI度量自动计算得到的话题词汇和人工挑选的热点词汇之间的语义相似度达到75%以上.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号