首页> 外文期刊>IEEE Transactions on Reliability >Improving Failure Tolerance in Large-Scale Cloud Computing Systems
【24h】

Improving Failure Tolerance in Large-Scale Cloud Computing Systems

机译:提高大型云计算系统的容错能力

获取原文
获取原文并翻译 | 示例
           

摘要

Large-scale cloud computing systems have served as the fundamental supporting platform for big data, Internet of Things, and artificial intelligence applications for the past decade. With the scale and complexity of these systems increasing dramatically, various hardware and software failures will inevitably occur and may not be detected and repaired in a timely manner. Besides, sophisticated architectural features of cloud computing may also have an adverse impact on system reliability. In response to these challenges, this paper proposes a simulation-driven framework based on real cloud computing system operation logs for improving failure tolerance in large-scale cloud computing systems. For a given cloud computing system, we first conduct a systematic analysis of its structure and operation characteristics. A Markov-based model is used to examine the system's potential failures, assess their severities, and suggest quick recoveries. During this process, the proposed reliability-aware resource scheduling algorithm is adopted to optimize resources so that the system's reliability can be improved cost-effectively. We also report a case study to demonstrate the application of our algorithm in improving failure tolerance of a large-scale cloud computing system.
机译:在过去的十年中,大型云计算系统已成为大数据,物联网和人工智能应用程序的基本支持平台。随着这些系统的规模和复杂性急剧增加,不可避免地会发生各种硬件和软件故障,并且可能无法及时检测和修复。此外,云计算的复杂架构功能也可能对系统可靠性产生不利影响。针对这些挑战,本文提出了一种基于仿真的框架,该框架基于真实的云计算系统操作日志,以提高大规模云计算系统的容错能力。对于给定的云计算系统,我们首先对其结构和操作特征进行系统分析。基于Markov的模型用于检查系统的潜在故障,评估其严重性并建议快速恢复。在此过程中,采用了建议的可靠性感知资源调度算法来优化资源,从而可以经济高效地提高系统的可靠性。我们还报告了一个案例研究,以证明我们的算法在提高大型云计算系统的容错能力方面的应用。

著录项

  • 来源
    《IEEE Transactions on Reliability》 |2019年第2期|620-632|共13页
  • 作者单位

    Univ Elect Sci & Technol, Sch Comp Sci & Engn, Chengdu 611731, Sichuan, Peoples R China;

    Univ Elect Sci & Technol, Sch Comp Sci & Engn, Chengdu 611731, Sichuan, Peoples R China;

    Univ Elect Sci & Technol, Sch Comp Sci & Engn, Chengdu 611731, Sichuan, Peoples R China;

    Univ Elect Sci & Technol, Sch Comp Sci & Engn, Chengdu 611731, Sichuan, Peoples R China;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Cloud computing; failure tolerance; large-scale system; Markov model;

    机译:云计算;故障耐受性;大规模系统;马尔可夫模型;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号