Fault Tolerance For Main-Memory Applications In The Cloud

机译：云中主内存应用程序的容错

页面导航

摘要
著录项
相似文献
相关主题

摘要

Advances in hardware have enabled many long-running applications to execute entirely in main memory. With the emergence of cloud computing, thousands of machines could be made available to deploy such applications with lowered operational and maintenance costs. While achieving substantially better performance, these applications have encountered new challenges in achieving fault tolerance; i.e., to ensure durability in the event of a crash. In addition, many of these applications, such as massively multiplayer online games, main-memory OLTP systems, main-memory search engine and deterministic transaction processing systems, must sustain extremely high update rates - often hundreds of thousands of updates per second. They also demand extremely high throughput (e.g. scientific simulation) or low latency (e.g. massively multiplayer online games). To support these demanding requirements, these applications have increasingly turned to database techniques. In this dissertation, we propose an approach to provide fault tolerance for main-memory applications without introducing excessive overhead or latency spikes. First, we evaluate the applicability of existing checkpoint recovery techniques developed for main-memory DBMS. We use massively multiplayer online games (MMOs) as our motivating example. In particular, we show how to adapt consistent checkpointing techniques developed for main-memory databases to MMOs. Furthermore, we provide a thorough simulation model and evaluation of six recovery strategies. Based on our results, we argue that not all state-of-the-art checkpoint recovery techniques are equally suited for low-latency and high-throughput applications such as MMOs. These algo- rithms either use locks or large synchronous copy operations, which hurt throughput and latency, respectively. Next, we take advantage of frequent points of consistency in many of these applications to develop novel checkpoint recovery algorithms that trade additional space in main memory for significantly lower overhead and latency. Compared to previous work, our new algorithms do not require any locking or bulk copies of the application state. Our experimental evaluation shows that one of our new algorithms attains nearly constant latency and reduces overhead by more than an order of magnitude for low to medium update rates. Additionally, in a heavily loaded main-memory transaction processing system, it still reduces overhead by more than a factor of two. Finally, we present BRRL, a library for making distributed main-memory applications fault tolerant. BRRL is optimized for cloud applications with frequent points of consistency that use data-parallelism to avoid complex concurrency control mechanisms. BRRL differs from existing recovery libraries by providing a simple table abstraction and using schema information to optimize checkpointing.

机译：硬件的进步使许多长期运行的应用程序可以完全在主内存中执行。随着云计算的出现，数以千计的机器可用于部署此类应用程序，同时降低了运营和维护成本。这些应用程序在获得实质上更好的性能的同时，在实现容错能力方面也遇到了新的挑战。即为了确保在发生碰撞时的耐久性。此外，许多此类应用程序（例如大型多人在线游戏，主内存OLTP系统，主内存搜索引擎和确定性事务处理系统）必须维持极高的更新率-每秒通常数十万次更新。他们还需要极高的吞吐量（例如科学模拟）或低延迟（例如大型多人在线游戏）。为了支持这些苛刻的要求，这些应用程序越来越多地转向数据库技术。在本文中，我们提出了一种在不引入过多开销或延迟尖峰的情况下为主存储器应用程序提供容错能力的方法。首先，我们评估为主要内存DBMS开发的现有检查点恢复技术的适用性。我们以大型多人在线游戏（MMO）为例。特别是，我们展示了如何将针对主内存数据库开发的一致检查点技术应用于MMO。此外，我们提供了全面的仿真模型并评估了六种恢复策略。根据我们的结果，我们认为并非所有最新的检查点恢复技术都同样适用于低延迟和高吞吐量的应用程序，例如MMO。这些算法要么使用锁，要么使用大型同步复制操作，这分别损害了吞吐量和延迟。接下来，我们利用这些应用程序中频繁出现的一致性点，开发出新颖的检查点恢复算法，该算法可以在主内存中交换额外的空间，从而显着降低开销和延迟。与以前的工作相比，我们的新算法不需要对应用程序状态进行任何锁定或批量复制。我们的实验评估表明，对于低至中的更新速率，我们的一种新算法可实现近乎恒定的延迟，并将开销减少一个数量级以上。此外，在负载较重的主内存事务处理系统中，它仍将开销减少了两倍以上。最后，我们介绍了BRRL，这是一个用于使分布式主内存应用程序具有容错能力的库。 BRRL针对具有频繁一致性点的云应用程序进行了优化，这些应用程序使用数据并行性来避免复杂的并发控制机制。 BRRL与现有恢复库的区别在于，它提供了简单的表抽象并使用架构信息来优化检查点。

著录项

作者
Cao Tuan;
展开▼
作者单位

展开▼
年度 2013
总页数
原文格式 PDF
正文语种 en_US
中图分类

相似文献

外文文献
中文文献
专利

1. Architecture-Based Reliability-Sensitive Criticality Measure for Fault-Tolerance Cloud Applications [J] . Wang Lei IEEE Transactions on Parallel and Distributed Systems . 2019,第11期

机译：容错云应用程序的基于架构的可靠性敏感度度量
2. A Selective Mirrored Task Based Fault Tolerance Mechanism for Big Data Application Using Cloud [J] . Hao Wu, Qinggeng Jin, Chenghua Zhang, Wireless communications & mobile computing . 2019,第1期

机译：基于云的大数据应用的基于选择性镜像任务的容错机制
3. Adaptive Application Scaling for Improving Fault-Tolerance and Availability in the Cloud [J] . Ganesan Radhakrishnan Bell Labs technical journal . 2012,第2期

机译：自适应应用程序扩展以提高云中的容错能力和可用性
4. FT2R2Cloud: Fault tolerance using time-out and retransmission of requests for cloud applications [C] . Mylara Reddy C., Nalini N. International Conference on Advances in Electronics, Computers and Communications . 2014

机译：FT2R2Cloud：使用超时和对云应用程序的请求重新传输的容错能力
5. Fault resilience in main-memory databases: Handling process halting failures and data corruption [D] . Bohannon, Philip Lewis 1999

机译：主内存数据库中的容错能力：处理过程中止故障和数据损坏
6. Optimizing the Reliability and Performance of Service Composition Applications with Fault Tolerance in Wireless Sensor Networks [O] . Zhao Wu, Naixue Xiong, Yannong Huang, 2015

机译：通过无线传感器网络中的容错功能优化服务组合应用程序的可靠性和性能
7. Fault Tolerance and Scaling in e-Science Cloud Applications: Observations from the Continuing Development of MODISAzure [O] . Jie Li, Marty Humphrey, You-Wei Cheah, 2010

机译：电子科学云应用中的容错和缩放：从多样性开发的观察

Fault Tolerance For Main-Memory Applications In The Cloud

摘要

著录项

相似文献

相关主题

期刊订阅