...
首页> 外文期刊>Scientific Research and Essays >An efficient hybrid distributed document clustering algorithm
【24h】

An efficient hybrid distributed document clustering algorithm

机译:一种高效的混合分布式文档聚类算法

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Recent advances in information technology have led to an increase in volumes of data thereby exceeding beyond petabytes. Clustering distributed document sets from a central location is difficult due to the massive demand of computational resources. So there is a need for distributed document clustering algorithms to cluster documents using distributed resources. The greatest challenge in this area of distributed document clustering is the clustering quality and speedup associated with increase in document sets. The proposed clustering algorithm uses a hybrid algorithm which comprises of Particle Swarm Optimization (PSO), K-Means clustering and Latent Semantic Indexing (LSI) algorithm (PKMeansLSI), and uses MapReduce framework for distributed computation. The resultant of this is that it ultimately promotes clustering quality of the algorithm. The MapReduce framework and its corresponding implementation Hadoop is used as a distributed programming model which stresses on the improvement factor of the speedup of algorithm. The execution time is dramatically reduced as the dimensionality of documents is reduced. Experiment results show improved quality and effectiveness of the hybrid algorithm with varying increase in document size.
机译:信息技术的最新发展已导致数据量的增加,从而超过了PB。由于对计算资源的巨大需求,很难从中心位置对分布式文档集进行聚类。因此,需要分布式文档聚类算法以使用分布式资源来聚类文档。在分布式文档聚类领域,最大的挑战是与文档集增加相关的聚类质量和速度。提出的聚类算法使用了一种混合算法,该算法包括粒子群优化(PSO),K-Means聚类和潜在语义索引(LSI)算法(PKMeansLSI),并使用MapReduce框架进行分布式计算。结果是,它最终提高了算法的聚类质量。 MapReduce框架及其相应的实现Hadoop被用作分布式编程模型,该模型着重于算法加速的改进因素。随着文档尺寸的减少,执行时间将大大减少。实验结果表明,随着文档大小的增加,混合算法的质量和有效性得到了提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号