首页> 外文期刊>Cluster computing >A comparison study of clustering algorithms for microblog posts
【24h】

A comparison study of clustering algorithms for microblog posts

机译:微博帖子聚类算法的比较研究

获取原文
获取原文并翻译 | 示例
       

摘要

Clustering is a popular unsupervised learning approach for topic analysis in text mining. In this paper, we do a comparison study of clustering algorithms for microblog posts, including weighting and programming model. Our experimental data is crawled from Sina Weibo in China. They are the 74,662 microblogs of 14 topics about Internet Technology. First of all, we do preprocessing to these microblog posts. Then we propose a manual sampling based dynamic incremental clustering algorithm (MS-DICA) to extract the topic threads from the microblogs we crawled. We evaluate the proposed algorithm from four aspects. Moreover, experimental comparisons are done in terms of accuracy and efficiency with the traditional k-means algorithm. Our experimental results show that the proposed MS-DICA is effective in the topic thread extraction. Its accuracy is close to the traditional k-means algorithm, and the running speed improves more than five times. In addition, the MapReduce programming model in Hadoop distributed computation platform that can run paralleled the k-means algorithm for cluster speeding up.
机译:聚类是用于文本挖掘中主题分析的一种流行的无监督学习方法。在本文中,我们对微博帖子的聚类算法进行了比较研究,包括加权和编程模型。我们的实验数据来自中国的新浪微博。它们是有关Internet技术的14个主题的74,662个微博。首先,我们对这些微博帖子进行预处理。然后,我们提出了一种基于手动采样的动态增量聚类算法(MS-DICA),以从抓取的微博中提取主题线程。我们从四个方面评估提出的算法。此外,使用传统的k均值算法在准确性和效率方面进行了实验比较。我们的实验结果表明,提出的MS-DICA在主题线程提取方面是有效的。它的准确性接近传统的k均值算法,运行速度提高了五倍以上。此外,Hadoop分布式计算平台中的MapReduce编程模型可以与k-means算法并行运行以加速群集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号