首页> 外文会议>2012 IEEE 4th International Conference on Cloud Computing Technology and Science. >Cluster analysis for the cloud: Parallel Competitive Fitness and parallel K-means#x002B;#x002B; for large dataset analysis
【24h】

Cluster analysis for the cloud: Parallel Competitive Fitness and parallel K-means#x002B;#x002B; for large dataset analysis

机译:云的聚类分析:并行竞争适应度和并行K-means ++用于大型数据集分析

获取原文
获取原文并翻译 | 示例

摘要

The amount of resources needed to provision Virtual Machines (VM) in a cloud computing systems to support virtual HPC clusters can be predicted from the analysis of historic use data. In previous work, Hacker et al. found that cluster analysis is a useful tool to understand the underlying spatio-temporal dependencies present in system fault and use logs. However, the cluster analysis used for reducing spatio-temporal dependences should be fast and accurate to understand the underlying stochastic properties of these systems. K-means is a fast cluster analysis method, in which accuracy depends on the use of initialization algorithms that are usually serial and slow. In this paper we present two new parallel strategies for fast seeding K-means cluster analysis. Both strategies were tested on a real problem where the aim was to reduce spatial and temporal dependencies of failures on large supercomputer systems. The performance of both strategies were compared with five existing serial implementations: K-means implementations of 1) Lloyd (L); 2) McQueen (M); and 3) Hartigan — Wong (HW), all of them using Forgy seeding; 4) K-means++; and 5) Neural Gas clustering (NG), a more recent and sophisticated method. Our results show that our new Parallel Competitive Fitness approach reduces the Within Sum of Squares (WSQQ) measure, thus increasing cluster quality of the three K-means implementations: L; M; HW, and is 200 times faster than the existing serial K-means++. The existing serial and our new Parallel K-means++ have the lowest WSQQ. Our new Parallel K-means++ is twice as fast as the existing serial K-means++ method, and is 4 times faster than the NG method. Moreover, our new methods did not generate empty clusters, while NG did. As a result of our new techniques, predicting the amount of resources needed to provision VMs processing historic system fault and use data can now be done fas- er and with more accuracy.
机译:可以通过对历史使用数据的分析来预测在云计算系统中配置虚拟机(VM)以支持虚拟HPC集群所需的资源量。在先前的工作中,Hacker等人。发现集群分析是了解系统故障中存在的潜在时空依赖性并使用日志的有用工具。但是,用于减少时空依赖性的聚类分析应该快速而准确地理解这些系统的潜在随机特性。 K-means是一种快速的聚类分析方法,其准确性取决于通常串行且缓慢的初始化算法的使用。在本文中,我们提出了两种新的并行策略,用于快速播种K均值聚类分析。两种策略均在一个实际问题上进行了测试,目的是减少大型超级计算机系统上故障的时空依赖性。将这两种策略的效果与五个现有的串行实施方案进行了比较:K-均值实施方案1)劳埃德(L); 2)麦昆(M); 3)Hartigan-Wong(HW),所有人都使用Forgy播种; 4)K-均值++; 5)神经气体聚类(NG),这是一种更新的,更先进的方法。我们的结果表明,我们的新的并行竞争适应性方法减少了平方和(WSQQ)量度,从而提高了三种K均值实施的聚类质量: M;硬件,并且比现有的串行K-means ++快200倍。现有序列号和我们的新Parallel K-means ++具有最低的WSQQ。我们的新Parallel K-means ++速度是现有串行K-means ++方法的两倍,并且比NG方法快4倍。而且,我们的新方法不会生成空簇,而NG会生成空簇。由于采用了我们的新技术,现在可以更轻松,更准确地预测提供虚拟机以处理历史系统故障并使用数据所需的资源量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号