...
首页> 外文期刊>Distributed Computing >Fault tolerance in distributed systems using fused state machines
【24h】

Fault tolerance in distributed systems using fused state machines

机译:使用融合状态机的分布式系统中的容错

获取原文
获取原文并翻译 | 示例

摘要

Replication is a standard technique for fault tolerance in distributed systems modeled as deterministic finite state machines (DFSMs or machines). To correct f crash or 「f/2」J Byzantine faults among n different machines, replication requires nf backup machines. We present a solution called fusion that requires just f backup machines. First, we build a framework for fault tolerance in DFSMs based on the notion of Hamming distances. We introduce the concept of an (f , m)-fusion, which is a set of m backup machines that can correct f crash faults or 「f/2」 Byzantine faults among a given set of machines. Second, we present an algorithm to generate an (f, f)-fusion for a given set of machines. We ensure that our backups are efficient in terms of the size of their state and event sets. Third, we use locality sensitive hashing for the detection and correction of faults that incurs almost the same overhead as that for replication. We detect Byzantine faults with time complexity O(nf) on average while we correct crash and Byzantine faults with time complexity O(nρf) with high probability, where ρ is the average state reduction achieved by fusion. Finally, our evaluation of fusion on the widely used MCNC'91 benchmarks for DFSMs shows that the average state space savings in fusion (over replication) is 38 % (range 0-99 %). To demonstrate the practical use of fusion, we describe its potential application to two areas: sensor networks and the MapRe-duce framework. In the case of sensor networks a fusion-based solution can lead to significantly fewer sensor-nodes than a replication-based solution. For the MapReduce framework, fusion can reduce the number of map-tasks compared to replication. Hence, fusion results in considerable savings in state space and other resources such as the power needed to run the backups.
机译:复制是建模为确定性有限状态机(DFSM或机器)的分布式系统中的容错标准技术。要纠正n台不同机器之间的f崩溃或“ f / 2” J拜占庭式错误,复制需要nf备份机器。我们提出了一种称为融合的解决方案,仅需要f台备用计算机。首先,我们基于汉明距离的概念建立了DFSM中的容错框架。我们介绍(f,m)-fusion的概念,它是一组m个备用计算机,可以纠正给定一组计算机中的f个崩溃故障或“ f / 2”拜占庭式故障。其次,我们提出一种算法,用于为给定的一组机器生成(f,f)-融合。我们确保我们的备份在其状态和事件集的大小方面都是有效的。第三,我们使用局部敏感哈希来检测和纠正错误,这些错误所产生的开销几乎与复制所产生的开销相同。我们以平均概率检测时间复杂度为O(nf)的拜占庭式故障,同时以高概率纠正崩溃和时间复杂度为O(nρf)的拜占庭式故障,其中ρ是通过融合实现的平均状态降低。最后,我们在DFSM的广泛使用的MCNC'91基准上对融合的评估显示,融合(过度复制)中平均状态空间节省为38%(范围为0-99%)。为了演示融合的实际应用,我们将其应用于两个领域:传感器网络和MapRe-duce框架。在传感器网络的情况下,与基于复制的解决方案相比,基于融合的解决方案可以导致更少的传感器节点。对于MapReduce框架,与复制相比,融合可以减少映射任务的数量。因此,融合可大大节省状态空间和其他资源,例如运行备份所需的电源。

著录项

  • 来源
    《Distributed Computing 》 |2014年第4期| 287-311| 共25页
  • 作者单位

    EDGE Lab, Department of Electrical Engineering, Princeton University, Engineering Quadrangle, Olden Street, Princeton, NJ 08544, USA;

    Parallel and Distributed Systems Laboratory, Department of Electrical and Computer Engineering, The University of Texas at Austin, 1 University Station, C0803, Austin, TX 78712-0240, USA;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Distributed systems; Fault tolerance; Finite state machines; Coding theory; Hamming distances;

    机译:分布式系统;容错能力有限状态机;编码理论;海明距离;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号