首页> 外文期刊>ACM transactions on Asian language information processing >Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents
【24h】

Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents

机译:使用Bisect K-Means聚类技术分析阿拉伯文献

获取原文
获取原文并翻译 | 示例
       

摘要

In this article, I have investigated the performance of the bisect K-means clustering algorithm compared to the standard K-means algorithm in the analysis of Arabic documents. The experiments included five commonly used similarity and distance functions (Pearson correlation coefficient, cosine, Jaccard coefficient, Euclidean distance, and averaged Kullback-Leibler divergence) and three leading stemmers. Using the purity measure, the bisect K-means clearly outperformed the standard K-means in all settings with varying margins. For the bisect K-means, the best purity reached 0.927 when using the Pearson correlation coefficient function, while for the standard K-means, the best purity reached 0.884 when using the Jaccard coefficient function. Removing stop words significantly improved the results of the bisect K-means but produced minor improvements in the results of the standard K-means. Stemming provided additional minor improvement in all settings except the combination of the averaged Kullback-Leibler divergence function and the root-based stemmer, where the purity was deteriorated by more than 10%. These experiments were conducted using a dataset with nine categories, each of which contains 300 documents.
机译:在本文中,我研究了在分析阿拉伯文档时,与标准K-means算法相比,二等分K-means聚类算法的性能。实验包括五个常用的相似性和距离函数(皮尔森相关系数,余弦,雅卡德系数,欧几里得距离和平均库尔巴克-莱布勒发散)和三个前导词干。使用纯度度量,二等分K均值在所有设置中均以不同的余量明显优于标准K均值。对于二等分K均值,使用Pearson相关系数函数时,最佳纯度达到0.927,而对于标准K均值,当使用Jaccard系数函数时,最佳纯度达到0.884。删除停用词会显着改善bisect K均值的结果,但对标准K均值的结果却会产生较小的改进。除平均Kullback-Leibler发散函数和基于根的茎茎的结合(纯度降低了10%以上)外,茎秆在所有情况下均提供了其他较小的改进。这些实验是使用具有9个类别的数据集进行的,每个类别包含300个文档。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号