Microblogging has the characteristics of large number,fewer words and wide range of topics,these lead to quite a few isolated points (outliers)in microblogging data which have adverse effect on clustering algorithm of microblogging hot topics.Therefore,we propose a microblogging topic discovery method which is based on outliers elimination.First,the outliers are removed from dataset,and then the CURE algorithm is used to cluster those data remained and having clustering value,finally the validity of the algorithm is verified by examples. Results show that,compared with contrastive clustering algorithm,the proposed algorithm reduces the sensitivity of clustering result on outliers,improves the accuracy of microblogging hot topics discovery,and raises the operation efficiency of the algorithm,it is more suitable for applying in large-scale microblogging hot topics discovery.%微博具有数量多、字数少、话题广泛等特点,导致数据中孤立点较多,对微博热点话题聚类算法产生不利影响,为此,提出一种消除孤立点的微博热点话题发现方法。首先消除数据集中的孤立点,然后采用CURE(Clustering Using Representatives)算法对剩余有聚类价值的数据进行聚类,最后通过实例验证算法的有效性。结果表明,相对于对比聚类算法,该算法降低聚类结果对孤立点的敏感度,提高了微博热点话题发现的准确性,并提高了算法的运行效率,更适合应用于大规模的微博热点话题发现。
展开▼