首页> 外文期刊>ACM transactions on software engineering and methodology >Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AlOps Solution
【24h】

Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AlOps Solution

机译:在超大规模云计算平台中预测节点故障:AlOps解决方案

获取原文
获取原文并翻译 | 示例
           

摘要

Many software services today are hosted on cloud computing platforms, such as Amazon EC2, due to many benefits like reduced operational costs. However, node failures in these platforms can impact the availability of their hosted services and potentially lead to large financial losses. Predicting node failures before they actually occur is crucial, as it enables DevOps engineers to minimize their impact by performing preventative actions. However, such predictions are hard due to many challenges like the enormous size of the monitoring data and the complexity of the failure symptoms. AIOps (Artificial Intelligence for IT Operations), a recently introduced approach in DevOps, leverages data analytics and machine learning to improve the quality of computing platforms in a cost-effective manner. However, the successful adoption of such AIOps solutions requires much more than a top-performing machine learning model. Instead, AIOps solutions must be trustable, interpretable, maintainable, scalable, and evaluated in context. To cope with these challenges, in this article we report our process of building an AIOps solution for predicting node failures for an ultra-large-scale cloud computing platform at Alibaba. We expect our experiences to be of value to researchers and practitioners, who are interested in building and maintaining AIOps solutions for large-scale cloud computing platforms.
机译:由于降低了运营成本等诸多优势,当今许多软件服务都托管在诸如Amazon EC2之类的云计算平台上。但是,这些平台中的节点故障可能会影响其托管服务的可用性,并可能导致巨大的财务损失。在节点故障实际发生之前对其进行预测至关重要,因为它使DevOps工程师可以通过执行预防措施来最大程度地减少其影响。但是,由于诸如监控数据的巨大规模和故障症状的复杂性之类的许多挑战,这种预测是困难的。 DevOps中最近引入的一种方法AIOps(IT运营人工智能)利用数据分析和机器学习以经济高效的方式提高计算平台的质量。但是,成功采用这种AIOps解决方案不仅需要性能最佳的机器学习模型。相反,AIOps解决方案必须是可信任的,可解释的,可维护的,可伸缩的,并且必须在上下文中进行评估。为了应对这些挑战,在本文中,我们报告了构建AIOps解决方案的过程,该解决方案用于预测阿里巴巴的超大规模云计算平台的节点故障。我们希望我们的经验对有兴趣为大型云计算平台构建和维护AIOps解决方案的研究人员和从业人员有价值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号