...
首页> 外文期刊>Journal of computer sciences >COMPARATIVE STUDY OF K-MEANS AND K-MEANS++ CLUSTERING ALGORITHMS ON CRIME DOMAIN | Science Publications
【24h】

COMPARATIVE STUDY OF K-MEANS AND K-MEANS++ CLUSTERING ALGORITHMS ON CRIME DOMAIN | Science Publications

机译:犯罪域上的K-均值和K-MEANS ++聚类算法的比较研究科学出版物

获取原文
           

摘要

> This study presents the results of an experimental study of two document clustering techniques which are k-means and k-means++. In particular, we compare the two main approaches in crime document clustering. The drawback of k-means is that the user needs to define the centroid point. This becomes more critical when dealing with document clustering because each center point represented by a word and the calculation of distance between words is not a trivial task. To overcome this problem, a k-means++ was introduced in order to find a good initial center point. Since k-means++ has not being applied before in crime document clustering, this study presented a comparative study between k-means and k-means++ to investigate whether the initialization process in k-means++ does help to get a better results than k-means. We proposes the k-means++ clustering algorithm, to identify best seed for initial cluster centers in clustering crime document. The aim of this study is to conduct a comparative study of two main clustering algorithms, namely k-means and k-means++. The method of this study includes a pre-processing phase, which in turn involves tokeniza-tion, stop-words removal and stemming. In addition, we evaluate the impact of the two similarity/distance measures (Cosine similarity and Jaccard coefficient) on the results of the two clustering algorithms. Exper-imental results on several settings of the crime data set showed that by identifying the best seed for initial cluster centers, k-mean++ can significantly (with the significance interval at 95%) work better than k-means. These results demonstrate the accuracy of k-mean++ clustering algorithm in clustering crime doc-uments.
机译: >这项研究提供了对两种文档聚类技术(k-means和k-means ++)的实验研究结果。特别是,我们比较了犯罪文件聚类中的两种主要方法。 k均值的缺点是用户需要定义质心点。当处理文档聚类时,这变得尤为重要,因为由单词表示的每个中心点以及单词之间的距离的计算并不是一件容易的事。为了克服这个问题,为了找到一个好的初始中心点,引入了k-means ++。由于k-means ++之前从未在犯罪文档聚类中应用,因此本研究提出了k-means和k-means ++之间的比较研究,以研究k-means ++中的初始化过程是否确实比k-means更好地获得了结果。我们提出了k-means ++聚类算法,为聚类犯罪文档中的初始聚类中心确定最佳种子。本研究的目的是对两种主要的聚类算法即k-means和k-means ++进行比较研究。这项研究的方法包括一个预处理阶段,该阶段依次涉及标记化,停用词删除和词干提取。此外,我们评估了两种相似度/距离度量(余弦相似度和雅克卡系数)对两种聚类算法结果的影响。在犯罪数据集的多个设置上的实验结果表明,通过为初始聚类中心确定最佳种子,k-mean ++的效果显着(显着性区间为95%)优于k-means。这些结果证明了k-mean ++聚类算法在聚类犯罪文档中的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号