An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop

机译：使用改进的余弦距离度量的改进的K-means算法，用于使用Mahout和Hadoop进行文档聚类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we have proposed a novel K-means algorithm with modified Cosine Distance Measure for clustering of large datasets like Wikipedia latest articles and Reuters dataset. We are customizing Cosine Distance Measure for computing similarity between objects for improving cluster quality. Our method will calculate the similarity between objects by Cosine Distance Measure and then try to bring distance more closer by squaring the distance if it is between 0 to 0.5 else increase it. It will result in minimum Intra-cluster and maximizes Inter-cluster distance value. We are measuring cluster quality in term of Inter and Intra-cluster distances, good Feature weighting such as TF-IDF, Cluster Size and Top terms of the clusters. We have compared K-means algorithm by Cosine and modified Cosine Distance measure by setting performance metric such as Inter-cluster and Intra-cluster distances, Cluster size, Execution time etc. Our experimental result shows in minimizing Intra-cluster by 0.016% and maximizing Inter-cluster distance by 0.012%, reducing the cluster size by 1.5% and reducing sequence file size by 4%, that will result in good cluster quality.

机译：在本文中，我们提出了一种具有改进的余弦距离测度的新颖的K均值算法，用于对大型数据集（如Wikipedia最新文章和路透社数据集）进行聚类。我们正在定制余弦距离测度，以计算对象之间的相似度，以提高群集质量。我们的方法将通过余弦距离测量来计算对象之间的相似度，然后尝试通过平方距离（如果介于0到0.5之间）来使距离更近，否则将其增加。这将导致最小的集群内并使集群间距离值最大化。我们正在根据群集间和群集内距离，良好的特征权重（例如TF-IDF，群集大小和群集的优先项）来衡量群集质量。我们通过设置性能度量标准（例如集群间和集群内距离，集群大小，执行时间等），比较了余弦的K-means算法和改进的余弦距离度量。我们的实验结果表明，将集群内最小化0.016％并将最大化群集之间的距离减少0.012％，群集大小减少1.5％，序列文件大小减少4％，这将导致良好的群集质量。

著录项

来源
《IEEE International Conference on Industrial and Information Systems》|2014年|1-5|共5页
会议地点
作者
Sahu Lokesh; Mohan Biju R.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
document handling; parallel processing; pattern clustering; Hadoop; Mahout; Reuters dataset; TF-IDF; Wikipedia; cluster quality improvement; cluster quality measurement; cluster size reduction; document clustering; execution time; feature weighting; intercluster distance value maximization; k-means algorithm; minimum intracluster distance value; modified cosine distance measure; object similarity analysis; performance metric; sequence file size reduction; Algorithm design and analysis; Clustering algorithms; Encyclopedias; Internet; Size measurement; Time measurement; Vectors; Document Clustering; Hadoop; K-means; Mahout;

机译：文档处理;并行处理;模式聚类; Hadoop; Mahout; Reuters数据集; TF-IDF; Wikipedia;聚类质量改进;聚类质量测量;聚类尺寸减小;文档聚类;执行时间;特征加权;聚类间距离值最大化; k-均值算法;最小集群内距离值;修正的余弦距离度量;对象相似性分析;性能度量;序列文件大小减少;算法设计和分析;聚类算法;百科全书;互联网;大小测量;时间测量;向量;文档聚类; Hadoop; K-均值; Mahout;

相似文献

外文文献
中文文献
专利

1. Improved rough k-means clustering algorithm based on weighted distance measure with Gaussian function [J] . Zhang Tengfei, Ma Fumin International journal of computer mathematics . 2017,第1a4期

机译：基于高斯函数加权距离测度的改进的粗糙k均值聚类算法
2. Improved K-Means Clustering Algorithm for Big Data Mining under Hadoop Parallel Framework [J] . Journal of grid computing . 2020,第2期

机译：Hadoop并行框架下的大数据挖掘改进的K-means聚类算法
3. An Extensive Study of Similarity and Dissimilarity Measures Used for Text Document Clustering using K-means Algorithm [J] . Maedeh Afzali, Suresh Kumar International Journal of Information Technology and Computer Science . 2018,第9期

机译：基于K-means算法的文本文档聚类中相似度和相异度度量的广泛研究
4. An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop [C] . Sahu Lokesh, Mohan Biju R. IEEE International Conference on Industrial and Information Systems . 2014

机译：使用MAHOUT与Hadoop使用修改余弦距离测量的改进的K-means算法
5. Accelerating Mahout on heterogeneous clusters using HadoopCL. [D] . Li, Xiangyu. 2015

机译：使用HadoopCL在异构集群上加速Mahout。
6. Does Determination of Initial Cluster Centroids Improve the Performance of K-Means Clustering Algorithm? Comparison of Three Hybrid Methods by Genetic Algorithm Minimum Spanning Tree and Hierarchical Clustering in an Applied Study [O] . Saeedeh Pourahmad, Atefeh Basirat, Amir Rahimi, 2020

机译：初始簇质心的确定是否提高了K-Means聚类算法的性能？应用研究中遗传算法最小生成树和分层聚类的三种混合方法的比较
7. A Research and Implementation with Improved K-Means Clustering algorithm To Image Retrieval System Based On Hadoop Platform [O] . 黎光谱 2014

机译：基于Hadoop平台的改进K均值聚类算法在图像检索系统中的研究与实现。

An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop

摘要

著录项

相似文献

相关主题

期刊订阅