首页> 外文会议>IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing >Addressing the Challenges of Executing a Massive Computational Cluster in the Cloud

【24h】

Addressing the Challenges of Executing a Massive Computational Cluster in the Cloud

机译：应对在云中执行大规模计算集群的挑战

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

A major limitation for time-to-science can be the lack of available computing resources. Depending on the capacity of resources, executing an application suite with hundreds of thousands of jobs can take weeks when resources are in high demand. We describe how we dynamically provision a large scale high performance computing cluster of more than one million cores utilizing Amazon Web Services (AWS). We discuss the trade-offs, challenges, and solutions associated with creating such a large scale cluster with commercial cloud resources. We utilize our large scale cluster to study a parameter sweep workflow composed of message-passing parallel topic modeling jobs on multiple datasets. At peak, we achieve a simultaneous core count of 1,119,196 vCPUs across nearly 50,000 instances, and are able to execute almost half a million jobs within two hours utilizing AWS Spot Instances in a single AWS region. Our solutions to the challenges and trade-offs have broad application to the lifecycle management of similar clusters on other commercial clouds.

机译：上科学时间的主要限制可能是缺少可用的计算资源。根据资源的容量，当对资源的需求很高时，执行带有成千上万个作业的应用程序套件可能需要花费数周的时间。我们描述了如何利用Amazon Web Services（AWS）动态地提供超过一百万个内核的大规模高性能计算集群。我们将讨论与使用商业云资源创建如此大规模的集群相关的权衡，挑战和解决方案。我们利用大型集群研究参数扫描工作流，该工作流由对多个数据集进行消息传递的并行主题建模作业组成。高峰时，我们在近50,000个实例中实现了1,119,196个vCPU的同时核心数量，并且能够利用单个AWS区域中的AWS Spot实例在两个小时内执行近半百万个作业。我们针对挑战和权衡取舍的解决方案已广泛应用于其他商业云上类似集群的生命周期管理。

著录项

来源
《IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing 》|2018年|253-262|共10页
会议地点
作者
Brandon Posey; Christopher Gropp; Boyd Wilson; Boyd McGeachie; Sanjay Padhi; Alexander Herzog; Amy Apon;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Cloud computing; Tools; Pricing; Clouds; Google;

机译：云计算;工具;定价;云;谷歌;

相似文献

外文文献
中文文献
专利

1. Challenges and Solutions in Executing Numerical Weather Prediction in a Cloud Infrastructure [J] . Emmanuell D. Carren?o, Eduardo Roloff, Philippe O.A. Navaux Procedia Computer Science . 2015 ,第1期

机译：在云基础架构中执行数值天气预报的挑战和解决方案
2. The formation of the young massive cluster B1 in the Antennae Galaxies (NGC 4038/NGC 4039) triggered by cloud–cloud collision [J] . Tsuge Kisetsu, Tachihara Kengo, Fukui Yasuo, Publications of the Astronomical Society of Japan . 2021 ,第2期

机译：由云云碰撞触发的天线星系中的年轻大型集群B1（NGC 4038 / NGC 4039）
3. The formation of young massive clusters triggered by cloud–cloud collisions in the Antennae galaxies NGC 4038/NGC 4039 [J] . Tsuge Kisetsu, Fukui Yasuo, Tachihara Kengo, Publications of the Astronomical Society of Japan . 2021 ,第2期

机译：由云云冲突触发的年轻大型集群的形成NGC 4038 / NGC 4039
4. Addressing the Challenges of Executing a Massive Computational Cluster in the Cloud [C] . Brandon Posey, Christopher Gropp, Boyd Wilson, IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing . 2018

机译：解决在云中执行大规模计算群集的挑战
5. Addressing Geographical Challenges in the Big Data Era Utilizing Cloud Computing [D] . Lan, Hai. 2020

机译：解决利用云计算大数据时代的地理挑战
6. Addressing current challenges in cancer immunotherapy with mathematical and computational modelling [O] . Anna Konstorum, Anthony T. Vella, Adam J. Adler, 2017

机译：通过数学和计算模型应对癌症免疫疗法中的当前挑战
7. The role of collision speed, cloud density, and turbulence in the formation of young massive clusters via cloud–cloud collisions [O] . Kong You Liow, Clare L Dobbs 2020

机译：通过云云冲突，碰撞速度，云密度和湍流在形成年轻大簇中的作用
8. Massively Parallel Tensor Contraction Framework for Coupled-Cluster Computations. [R] . Solomonik, E., Matthews, D., Hammond, J. R., 2014

机译：耦合集群计算的大规模并行张量收缩框架。

Addressing the Challenges of Executing a Massive Computational Cluster in the Cloud

摘要

著录项

相似文献

相关主题

期刊订阅