Reliable Computing Service in Massive-Scale Systems through Rapid Low-Cost Failover

Renyu Yang; Yang Zhang; Peter Garraghan; Yihui Feng; Jin Ouyang; Jie Xu; Zhuo Zhang; Chao Li

首页> 外文期刊>Services Computing, IEEE Transactions on >Reliable Computing Service in Massive-Scale Systems through Rapid Low-Cost Failover

【24h】

Reliable Computing Service in Massive-Scale Systems through Rapid Low-Cost Failover

机译：通过低成本快速故障转移在大规模系统中提供可靠的计算服务

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Large-scale distributed systems deployed as Cloud datacenters are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely adopted means to achieve such a goal is using redundant system components to implement user-transparent failover, yet its effectiveness must be balanced carefully without incurring heavy overhead when deployed-an important practical consideration for complex large-scale systems. Failover techniques developed for Cloud systems often suffer serious limitations, including mandatory restart leading to poor cost-effectiveness, as well as solely focusing on crash failures, omitting other important types, such as timing failures and simultaneous failures. This paper addresses these limitations by presenting a new approach to user-transparent failover for massive-scale systems. The approach uses soft-state inference to achieve rapid failure recovery and avoid unnecessary restart, with minimal system resource overhead. It also copes with different failures, including correlated and simultaneous events. The proposed approach was implemented, deployed and evaluated within Fuxi system, the underlying resource management system used within Alibaba Cloud. Results demonstrate that our approach tolerates complex failure scenarios while incurring at worst 228.5 microsecond instance overhead with 1.71 percent additional CPU usage.

机译：部署为Cloud数据中心的大规模分布式系统能够为具有各种业务需求的消费者提供服务。由于严重的软件和硬件故障，提供商面临着提供不间断可靠服务的压力，同时降低了运营成本。实现此目标的一种广泛采用的方法是使用冗余系统组件来实现用户透明的故障转移，但是在部署时必须仔细权衡其有效性而又不会产生大量开销，这是复杂的大型系统的重要实际考虑因素。为云系统开发的故障转移技术通常会受到严重的限制，包括强制重启导致较差的成本效益，以及仅专注于崩溃故障，而忽略了其他重要类型，例如定时故障和并发故障。本文通过提出一种用于大规模系统的用户透明故障转移的新方法来解决这些限制。该方法使用软状态推断来实现快速故障恢复，并避免不必要的重启，同时将系统资源开销降至最低。它还可以应对各种故障，包括相关事件和同步事件。所提出的方法是在阿里云内部使用的基础资源管理系统-Fuxi系统中实施，部署和评估的。结果表明，我们的方法可以承受复杂的故障情况，同时在最坏的228.5微秒实例开销下，另外占用1.71％的CPU使用率。

著录项

来源
《Services Computing, IEEE Transactions on》 |2017年第6期|969-983|共15页
作者
Renyu Yang; Yang Zhang; Peter Garraghan; Yihui Feng; Jin Ouyang; Jie Xu; Zhuo Zhang; Chao Li;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Computer crashes; Cloud computing; Fault tolerant systems; Timing; Resource management; Checkpointing; Reliability; Resource management;

机译：计算机崩溃;云计算;容错系统;定时;资源管理;检查点;可靠性;资源管理;

相似文献

外文文献
中文文献
专利

1. A low-cost and reliable optical inspection system for rapid surface roughness measurements of poly crystal line thin films [J] . C-C. Kuo, P.-J. Huang Materialwissenschaft und Werkstofftechnik . 2012,第10期

机译：一种低成本，可靠的光学检测系统，用于快速测量多晶线薄膜的表面粗糙度
2. Computing fractal dimension in supertransient systems directly, rapidly and reliably [J] . Breban R, Nusse HE EPL . 2006,第6期

机译：直接，快速，可靠地计算超瞬态系统中的分形维数
3. A rapid, low-cost deep learning system to classify strawberry disease based on cloud service [J] . YANG Guo-feng, YANG Yong, HE Zi-kang, 农业科学学报（英文版） . 2022,第002期

机译：基于云服务的快速，低成本的深度学习系统对草莓病进行分类
4. Improving Energy Efficiency of Ultra-Reliable Low-Latency and Delay Tolerant Services in Mobile Edge Computing Systems [C] . Rui Dong, Changyang She, Wibowo Hardjawana, IEEE International Conference on Communications Workshops . 2019

机译：提高移动边缘计算系统中超可靠的低延迟和延迟容忍服务的能效
5. Asynchronous global optimization for massive-scale computing. [D] . Desell, Travis. 2009

机译：大规模计算的异步全局优化。
6. An Iterative Low-Cost Strategy to Building Information Systems Allows a Small Jurisdiction Local Health Department to Increase Efficiencies and Expand Services [O] . Kay A. Lovelace, Gulzar H. Shah -1

机译：建立信息系统的迭代低成本策略允许较小的辖区地方卫生部门提高效率和扩展服务
7. Reliable computing service in massive-scale systems through rapid low-cost failover [O] . Yang Renyu, Zhang Yang, Garraghan Peter, 2017

机译：通过快速的低成本故障转移在大规模系统中提供可靠的计算服务

Reliable Computing Service in Massive-Scale Systems through Rapid Low-Cost Failover

摘要

著录项

相似文献

相关主题

期刊订阅