首页> 外文OA文献 >Reliable computing service in massive-scale systems through rapid low-cost failover
【2h】

Reliable computing service in massive-scale systems through rapid low-cost failover

机译:通过快速的低成本故障转移在大规模系统中提供可靠的计算服务

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Large-scale distributed systems in Cloud datacenter are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely used means to achieve such a goal is using redundant system components to implement usertransparent failover, yet its effectiveness must be balanced carefully without incurring heavy overhead when deployed – an important practical consideration for complex large-scale systems. Failover techniques developed for Cloud systems often suffer serious limitations, including mandatory restart leading to poor cost-effectiveness, as well as solely focusing on crash failures, omitting other important types, e.g. timing failures and simultaneous failures. This paper addresses these limitations by presenting a new approach to user-transparent failover for massive-scale systems. The approach uses soft-state inference to achieve rapid failure recovery and avoid unnecessary restart, with minimal system resource overhead. It also copes with different failures, including correlated and simultaneous events. The proposed approach was implemented, deployed and evaluated within Fuxi system, the underlying resource management system used within Alibaba Cloud. Results demonstrate that our approach tolerates complex failure scenarios while incurring at worst 228.5 microsecond instance overhead with 1.71% additional CPU usage.
机译:Cloud数据中心中的大型分布式系统能够为具有各种业务需求的消费者提供服务。由于严重的软件和硬件故障,提供商面临着提供不间断可靠服务的压力,同时降低了运营成本。实现此目标的一种广泛使用的方法是使用冗余系统组件来实现用户透明的故障转移,但是在部署时必须仔细权衡其有效性,而不会产生大量开销–这是复杂的大型系统的重要实践考虑。为云系统开发的故障转移技术通常会受到严重的限制,包括强制重启导致较差的成本效益,以及仅关注崩溃故障,而忽略了其他重要类型,例如定时故障和同时故障。本文通过提出一种用于大规模系统的用户透明故障转移的新方法来解决这些限制。该方法使用软状态推断来实现快速故障恢复,并避免不必要的重启,同时将系统资源开销降至最低。它还可以应对各种故障,包括相关事件和同步事件。所提出的方法是在阿里云内部使用的基础资源管理系统-Fuxi系统中实施,部署和评估的。结果表明,我们的方法可以承受复杂的故障情况,同时在最坏的228.5微秒实例开销下,另外占用1.71%的CPU使用率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号