提出一种在大规模微博客短文本数据集上发现新闻话题的方法.利用隐主题分析技术,解决短文本相似度度量的问题.在每个时间窗口内,根据新闻的特点选取出最有可能谈论新闻事件的微博客文本,然后用两层的K均值和层次聚类的混合聚类方法,对这个时间窗口内的那些最有可能谈论新闻事件的微博文本进行聚类,从而检测出新闻话题.此方法能较好地解决微博客短文本的数据稀疏性及数据量巨大的问题.实验证明该算法的有效性.%A method of news topics extraction from large-scale short posts of microblogging-service is proposed. Through the hidden topic analysis, the similarity measurement of short texts is solved well. In every time window, the short posts which are most likely to talk about news events are selected according to the characteristics of the news. Then, a two-level K-means-hierarchical hybrid clustering method is used to cluster all the selected data into different news topics. The experimental results show the proposed method works well on large-scale microblog dataset.
展开▼