首页> 外文会议>Parallel and Distributed Computing, Applications and Technologies, 2009 >An Efficient Hierarchical Clustering Method for Large Datasets with Map-Reduce
【24h】

An Efficient Hierarchical Clustering Method for Large Datasets with Map-Reduce

机译:Map-Reduce的大型数据集高效分层聚类方法

获取原文

摘要

Large datasets become common in applications like Internet services, genomic sequence analysis and astronomical telescope. The demanding requirements of memory and computation power force data mining algorithms to be parallelized in order to efficiently deal with the large datasets. This paper introduces our experience of grouping internet users by mining a huge volume of Web access log of up to 100 gigabytes. The application is realized using hierarchical clustering algorithms with Map-Reduce, a parallel processing framework over clusters. However, the immediate implementation of the algorithms suffers from efficiency problem for both inadequate memory and higher execution time. This paper present an efficient hierarchical clustering method of mining large datasets with Map-Reduce. The method includes two optimization techniques: ¿Batch Updating¿ to reduce the computational time and communication costs among cluster nodes, and ¿Co-occurrence based feature selection¿ to decrease the dimension of feature vectors and eliminate noise features. The empirical study shows the first technique can significantly reduce the IO and distributed communication overhead, reducing the total execution time to nearly 1/15. Experimentally, the second technique efficiently simplifies the features while obtains improved accuracy of hierarchical clustering.
机译:大型数据集在Internet服务,基因组序列分析和天文望远镜等应用中很常见。内存和计算能力的苛刻要求迫使数据挖掘算法并行化,以便有效处理大型数据集。本文介绍了我们通过挖掘海量高达100 GB的Web访问日志对互联网用户进行分组的经验。该应用程序是使用带有Map-Reduce的分层聚类算法实现的,Map-Reduce是一个基于聚类的并行处理框架。然而,由于存储器不足和执行时间较长,算法的即时实现受到效率问题的困扰。本文提出了一种利用Map-Reduce挖掘大型数据集的高效分层聚类方法。该方法包括两种优化技术:ƒƒÂ,,BatchUpdatingÃÂ,以减少群集节点之间的计算时间和通信成本,以及ƒƒÂ,,CoCo基于共现的功能selectionââ€减少特征向量的维数并消除噪声特征。实证研究表明,第一种技术可以显着减少IO和分布式通信开销,从而将总执行时间减少到近1/15。实验上,第二种技术有效地简化了功能,同时提高了层次聚类的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号