首页> 外文期刊>Expert systems with applications >Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling
【24h】

Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling

机译:提高文档聚类的球形K均值:快速初始化,稀疏质心投影和有效的群集标签

获取原文
获取原文并翻译 | 示例

摘要

Due to its simplicity and intuitive interpretability, spherical k-means is often used for clustering a large number of documents. However, there exist a number of drawbacks that need to be addressed for much effective document clustering. Without well-dispersed initial points, spherical k-means fails to converge quickly, which is critical for clustering a large number of documents. Furthermore, its dense centroid vectors needlessly incorporate the impact of infrequent and less-informative words, thereby distorting the distance calculation between the document vectors.In this paper, we propose practical improvements on spherical k-means to overcome these issues during document clustering. Our proposed initialization method not only guarantees dispersed initial points, but is also up to 1000 times faster than previously well-known initialization method such as k-means++. Furthermore, we enforce sparsity on the centroid vectors by using a data-driven threshold that is capable of dynamically adjusting its value depending on the clusters. Additionally, we propose an unsupervised cluster labeling method that effectively extracts meaningful keywords to describe each cluster.We have tested our improvements on seven different text datasets that include both new and publicly available datasets. Based on our experiments on these datasets, we have found that our proposed improvements successfully overcome the drawbacks of spherical k-means in significantly reduced computation time. Furthermore, we have qualitatively verified the performance of the proposed cluster labeling method by extracting descriptive keywords of the clusters from these datasets. (c) 2020 Published by Elsevier Ltd.
机译:由于其简单性和直观的解释性,球面K型常用于聚类大量文档。然而,存在许多需要解决的缺点,以便有很多有效的文档聚类。没有分散良好的初始点,球面K-means无法快速收敛,这对于聚类大量文档至关重要。此外,它的密集质心向量不必要地融入了不频繁和更少的信息的影响,从而扭曲了文档向量之间的距离计算。在本文中,我们提出了对球面K-Mease的实际改进来克服文档聚类期间这些问题。我们所提出的初始化方法不仅保证分散的初始点,而且比以前众所周知的初始化方法(如K-Means ++)的速度快1000倍。此外,我们通过使用能够根据簇动态调整其值的数据驱动的阈值来强制对质心矢量的稀疏性。此外,我们提出了一种无监督的群集标记方法,可有效提取有意义的关键字来描述每个群集。我们已在包含新的文本数据集中测试了我们的改进,包括新的和公共可用数据集。基于我们对这些数据集的实验,我们发现我们提出的改进成功地克服了球面K-Means的缺点在显着降低的计算时间。此外,我们通过从这些数据集中提取群集的描述性关键字来定制验证所提出的群集标记方法的性能。 (c)2020由elestvier有限公司发布

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号