...
首页> 外文期刊>Services Computing, IEEE Transactions on >Reliable Computing Service in Massive-Scale Systems through Rapid Low-Cost Failover
【24h】

Reliable Computing Service in Massive-Scale Systems through Rapid Low-Cost Failover

机译:通过低成本快速故障转移在大规模系统中提供可靠的计算服务

获取原文
获取原文并翻译 | 示例
           

摘要

Large-scale distributed systems deployed as Cloud datacenters are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely adopted means to achieve such a goal is using redundant system components to implement user-transparent failover, yet its effectiveness must be balanced carefully without incurring heavy overhead when deployed-an important practical consideration for complex large-scale systems. Failover techniques developed for Cloud systems often suffer serious limitations, including mandatory restart leading to poor cost-effectiveness, as well as solely focusing on crash failures, omitting other important types, such as timing failures and simultaneous failures. This paper addresses these limitations by presenting a new approach to user-transparent failover for massive-scale systems. The approach uses soft-state inference to achieve rapid failure recovery and avoid unnecessary restart, with minimal system resource overhead. It also copes with different failures, including correlated and simultaneous events. The proposed approach was implemented, deployed and evaluated within Fuxi system, the underlying resource management system used within Alibaba Cloud. Results demonstrate that our approach tolerates complex failure scenarios while incurring at worst 228.5 microsecond instance overhead with 1.71 percent additional CPU usage.
机译:部署为Cloud数据中心的大规模分布式系统能够为具有各种业务需求的消费者提供服务。由于严重的软件和硬件故障,提供商面临着提供不间断可靠服务的压力,同时降低了运营成本。实现此目标的一种广泛采用的方法是使用冗余系统组件来实现用户透明的故障转移,但是在部署时必须仔细权衡其有效性而又不会产生大量开销,这是复杂的大型系统的重要实际考虑因素。为云系统开发的故障转移技术通常会受到严重的限制,包括强制重启导致较差的成本效益,以及仅专注于崩溃故障,而忽略了其他重要类型,例如定时故障和并发故障。本文通过提出一种用于大规模系统的用户透明故障转移的新方法来解决这些限制。该方法使用软状态推断来实现快速故障恢复,并避免不必要的重启,同时将系统资源开销降至最低。它还可以应对各种故障,包括相关事件和同步事件。所提出的方法是在阿里云内部使用的基础资源管理系统-Fuxi系统中实施,部署和评估的。结果表明,我们的方法可以承受复杂的故障情况,同时在最坏的228.5微秒实例开销下,另外占用1.71%的CPU使用率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号