首页> 外文会议>IEEE International Conference on Industrial and Information Systems >An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop
【24h】

An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop

机译:使用改进的余弦距离度量的改进的K-means算法,用于使用Mahout和Hadoop进行文档聚类

获取原文

摘要

In this paper, we have proposed a novel K-means algorithm with modified Cosine Distance Measure for clustering of large datasets like Wikipedia latest articles and Reuters dataset. We are customizing Cosine Distance Measure for computing similarity between objects for improving cluster quality. Our method will calculate the similarity between objects by Cosine Distance Measure and then try to bring distance more closer by squaring the distance if it is between 0 to 0.5 else increase it. It will result in minimum Intra-cluster and maximizes Inter-cluster distance value. We are measuring cluster quality in term of Inter and Intra-cluster distances, good Feature weighting such as TF-IDF, Cluster Size and Top terms of the clusters. We have compared K-means algorithm by Cosine and modified Cosine Distance measure by setting performance metric such as Inter-cluster and Intra-cluster distances, Cluster size, Execution time etc. Our experimental result shows in minimizing Intra-cluster by 0.016% and maximizing Inter-cluster distance by 0.012%, reducing the cluster size by 1.5% and reducing sequence file size by 4%, that will result in good cluster quality.
机译:在本文中,我们提出了一种具有改进的余弦距离测度的新颖的K均值算法,用于对大型数据集(如Wikipedia最新文章和路透社数据集)进行聚类。我们正在定制余弦距离测度,以计算对象之间的相似度,以提高群集质量。我们的方法将通过余弦距离测量来计算对象之间的相似度,然后尝试通过平方距离(如果介于0到0.5之间)来使距离更近,否则将其增加。这将导致最小的集群内并使集群间距离值最大化。我们正在根据群集间和群集内距离,良好的特征权重(例如TF-IDF,群集大小和群集的优先项)来衡量群集质量。我们通过设置性能度量标准(例如集群间和集群内距离,集群大小,执行时间等),比较了余弦的K-means算法和改进的余弦距离度量。我们的实验结果表明,将集群内最小化0.016%并将最大化群集之间的距离减少0.012%,群集大小减少1.5%,序列文件大小减少4%,这将导致良好的群集质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号