首页> 外文会议>IEEE International Symposium on Software Reliability Engineering Workshops >A Study of Failures in Community Clusters: The Case of Conte
【24h】

A Study of Failures in Community Clusters: The Case of Conte

机译:社区集群的失败研究:孔蒂案

获取原文

摘要

Large community clusters are becoming increasingly common in universities and other organizations due to the benefits they provide to the researchers in terms of operational costs and resource availability. However, efficient administration, failure diagnosis, and performance debugging on community clusters are challenging tasks due to the sheer diversity of workloads and users. These clusters are typically shared by users coming from various scientific domains and experience levels. Many users have little experience in computing and, hence, often face performance issues-leading to resource wastage. In this paper, we study these dynamics in one of the largest university-wide community clusters (Conte at Purdue University). We perform in-depth analysis of library and application usage patterns, job failures and performance issues. Further, we introduce a set of novel analysis techniques that can be used to identify hidden trends and diagnose job failures in compute clusters in general. We provide concrete recommendations for the cluster administrators and present case studies highlighting how such information can be used to proactively solve many user issues, ultimately leading to better quality of service.
机译:大型社区集群在运营成本和资源可获得性方面给研究人员带来的好处,在大学和其他组织中正变得越来越普遍。但是,由于工作负载和用户的多样性,在社区群集上进行有效的管理,故障诊断和性能调试是具有挑战性的任务。这些集群通常由来自不同科学领域和经验水平的用户共享。许多用户几乎没有计算经验,因此经常会遇到导致资源浪费的性能问题。在本文中,我们在最大的大学范围社区集群(普渡大学孔戴分校)中研究了这些动态。我们对库和应用程序使用模式,作业失败和性能问题进行深入分析。此外,我们介绍了一组新颖的分析技术,这些技术通常可用于识别隐藏趋势并诊断计算集群中的作业失败。我们为集群管理员提供具体建议,并提供案例研究,重点介绍如何使用此类信息来主动解决许多用户问题,最终提高服务质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号