首页> 外文会议>IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing >Addressing the Challenges of Executing a Massive Computational Cluster in the Cloud
【24h】

Addressing the Challenges of Executing a Massive Computational Cluster in the Cloud

机译:应对在云中执行大规模计算集群的挑战

获取原文

摘要

A major limitation for time-to-science can be the lack of available computing resources. Depending on the capacity of resources, executing an application suite with hundreds of thousands of jobs can take weeks when resources are in high demand. We describe how we dynamically provision a large scale high performance computing cluster of more than one million cores utilizing Amazon Web Services (AWS). We discuss the trade-offs, challenges, and solutions associated with creating such a large scale cluster with commercial cloud resources. We utilize our large scale cluster to study a parameter sweep workflow composed of message-passing parallel topic modeling jobs on multiple datasets. At peak, we achieve a simultaneous core count of 1,119,196 vCPUs across nearly 50,000 instances, and are able to execute almost half a million jobs within two hours utilizing AWS Spot Instances in a single AWS region. Our solutions to the challenges and trade-offs have broad application to the lifecycle management of similar clusters on other commercial clouds.
机译:上科学时间的主要限制可能是缺少可用的计算资源。根据资源的容量,当对资源的需求很高时,执行带有成千上万个作业的应用程序套件可能需要花费数周的时间。我们描述了如何利用Amazon Web Services(AWS)动态地提供超过一百万个内核的大规模高性能计算集群。我们将讨论与使用商业云资源创建如此大规模的集群相关的权衡,挑战和解决方案。我们利用大型集群研究参数扫描工作流,该工作流由对多个数据集进行消息传递的并行主题建模作业组成。高峰时,我们在近50,000个实例中实现了1,119,196个vCPU的同时核心数量,并且能够利用单个AWS区域中的AWS Spot实例在两个小时内执行近半百万个作业。我们针对挑战和权衡取舍的解决方案已广泛应用于其他商业云上类似集群的生命周期管理。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号