Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AlOps Solution

YANGGUANC LI; ZHEN MING (JACK) JIANG; HENG LI; AHMED E. HASSAN; CHENG HE; RUIRUI HUANG; ZHENGDA ZENG; MIAN WANG; PINAN CHEN

首页> 外文期刊>ACM transactions on software engineering and methodology >Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AlOps Solution

【24h】

Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AlOps Solution

机译：在超大规模云计算平台中预测节点故障：AlOps解决方案

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Many software services today are hosted on cloud computing platforms, such as Amazon EC2, due to many benefits like reduced operational costs. However, node failures in these platforms can impact the availability of their hosted services and potentially lead to large financial losses. Predicting node failures before they actually occur is crucial, as it enables DevOps engineers to minimize their impact by performing preventative actions. However, such predictions are hard due to many challenges like the enormous size of the monitoring data and the complexity of the failure symptoms. AIOps (Artificial Intelligence for IT Operations), a recently introduced approach in DevOps, leverages data analytics and machine learning to improve the quality of computing platforms in a cost-effective manner. However, the successful adoption of such AIOps solutions requires much more than a top-performing machine learning model. Instead, AIOps solutions must be trustable, interpretable, maintainable, scalable, and evaluated in context. To cope with these challenges, in this article we report our process of building an AIOps solution for predicting node failures for an ultra-large-scale cloud computing platform at Alibaba. We expect our experiences to be of value to researchers and practitioners, who are interested in building and maintaining AIOps solutions for large-scale cloud computing platforms.

机译：由于降低了运营成本等诸多优势，当今许多软件服务都托管在诸如Amazon EC2之类的云计算平台上。但是，这些平台中的节点故障可能会影响其托管服务的可用性，并可能导致巨大的财务损失。在节点故障实际发生之前对其进行预测至关重要，因为它使DevOps工程师可以通过执行预防措施来最大程度地减少其影响。但是，由于诸如监控数据的巨大规模和故障症状的复杂性之类的许多挑战，这种预测是困难的。 DevOps中最近引入的一种方法AIOps（IT运营人工智能）利用数据分析和机器学习以经济高效的方式提高计算平台的质量。但是，成功采用这种AIOps解决方案不仅需要性能最佳的机器学习模型。相反，AIOps解决方案必须是可信任的，可解释的，可维护的，可伸缩的，并且必须在上下文中进行评估。为了应对这些挑战，在本文中，我们报告了构建AIOps解决方案的过程，该解决方案用于预测阿里巴巴的超大规模云计算平台的节点故障。我们希望我们的经验对有兴趣为大型云计算平台构建和维护AIOps解决方案的研究人员和从业人员有价值。

著录项

来源
《ACM transactions on software engineering and methodology》 |2020年第2期|13.1-13.24|共24页
作者
YANGGUANC LI; ZHEN MING (JACK) JIANG; HENG LI; AHMED E. HASSAN; CHENG HE; RUIRUI HUANG; ZHENGDA ZENG; MIAN WANG; PINAN CHEN;
展开▼
作者单位

York University;

Queen's University;

Alibaba Group;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
AIOps; cloud computing; failure prediction; ultra-large-scale platforms;

机译：AIOps;云计算;故障预测;超大型平台;

相似文献

外文文献
中文文献
专利

1. Cloud Computing and Software Services: Interview with Simon Davies, Solution Architect Specialising in Microsoft Windows' Azure Cloud Platform [J] . UPGRADE: The European Journal for the Informatics Professional . 2010,第4期

机译：云计算和软件服务：专访Microsoft Windows Azure云平台的解决方案架构师Simon Davies
2. Towards a resource migration method in cloud computing based on node failure rule [J] . Zheng Zhigao, Huang Tao, Zhang Hao, Journal of intelligent & fuzzy systems: Applications in Engineering and Technology . 2016,第5期

机译：基于节点故障规则的云计算资源迁移方法
3. The consensus problem with dual failure nodes in a cloud computing environment [J] . Shun-Sheng Wang, Shu-Ching Wang Information Sciences: An International Journal . 2014,第Null期

机译：云计算环境中双故障节点的共识问题
4. A Solution for Single Point of Failure of Cloud Computing Platform in Electric Power Corporation [C] . Dewen Wang, Xiaomeng Liu International Forum on Computer and Information Technology . 2014

机译：电力公司云计算平台单点故障的解决方案
5. Autonomic Self-Healing in Cloud Computing Platforms =Autonome Selbstheilung in Cloud-Computing-Plattformen [D] . Gulenko, Anton. 2020

机译：云计算平台中的自主自我修复=云计算平台中的自治自我修复
6. HealtheDataLab – a cloud computing solution for data science and advanced analytics in healthcare with application to predicting multi-center pediatric readmissions [O] . Louis Ehwerhemuepha, Gary Gasperino, Nathaniel Bischoff, 2020

机译：HealthedAtalab - 用于预测多中心儿科入院的医疗保健中数据科学和高级分析的云计算解决方案
7. Evaluation of node failures in cloud computing using empirical data [O] . Alwabel Abdulelah, Walters Robert John, Wills Gary 2014

机译：使用经验数据评估云计算中的节点故障
8. Incentivized Cloud Computing: A Principal Agent Solution to the Cloud Computing Dilemma [R] . 2010

机译：激励云计算：云计算困境的主要代理解决方案

Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AlOps Solution

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅