首页> 外文学位 >Heavy Tails and Instabilities in Large-Scale Systems with Failures.
【24h】

Heavy Tails and Instabilities in Large-Scale Systems with Failures.

机译:具有故障的大型系统中的重尾和不稳定性。

获取原文
获取原文并翻译 | 示例

摘要

Modern engineering systems, e.g., wireless communication networks, distributed computing systems, etc., are characterized by high variability and susceptibility to failures. Failure recovery is required to guarantee the successful operation of these systems. One straight- forward and widely used mechanism is to restart the interrupted jobs from the beginning after a failure occurs. In network design, retransmissions are the primary building blocks of the network architecture that guarantee data delivery in the presence of channel failures. Retransmissions have recently been identified as a new origin of power laws in modern information networks. In particular, it was discovered that retransmissions give rise to long tails (delays) and possibly zero throughput. To this end, we investigate the impact of the 'retransmission phenomenon' on the performance of failure prone systems and propose adaptive solutions to address emerging instabilities.;The preceding finding of power law phenomena due to retransmissions holds under the assumption that data sizes have infinite support. In practice, however, data sizes are upper bounded 0 ≤ L ≤ b, e.g., WaveLAN's maximum transfer unit is 1500 bytes, YouTube videos are of limited duration, e-mail attachments cannot exceed 10MB, etc. To this end, we first provide a uniform characterization of the entire body of the distribution of the number of retransmissions, which can be represented as a product of a power law and the Gamma distribution. This rigorous approximation clearly demonstrates the transition from power law distributions in the main body to exponential tails. Furthermore, the results highlight the importance of wisely determining the size of data fragments in order to accommodate the performance needs in these systems as well as provide the appropriate tools for this fragmentation.;Second, we extend the analysis to the practically important case of correlated channels using modulated processes, e.g., Markov modulated, to capture the underlying dependencies. Our study shows that the tails of the retransmission and delay distributions are asymptotically insensitive to the channel correlations and are determined by the state that generates the lightest tail in the independent channel case. This insight is beneficial both for capacity planning and channel modeling since the independent model is sufficient and the correlation details do not matter. However, the preceding finding may be overly optimistic when the best state is atypical, since the effects of 'bad' states may still downgrade the performance.;Third, we examine the effects of scheduling policies in queueing systems with failures and restarts. Fair sharing, e.g., processor sharing (PS), is a widely accepted approach to resource allocation among multiple users. We revisit the well-studied M/G/1 PS queue with a new focus on server failures and restarts. Interestingly, we discover a new phenomenon showing that PS-based scheduling induces complete instability in the presence of retransmissions, regardless of how low the traffic load may be. This novel phenomenon occurs even when the job sizes are bounded/fragmented, e.g., deterministic. This work demonstrates that scheduling one job at a time, such as first-come-first-serve, achieves a larger stability region and should be preferred in these systems.;Last, we delve into the area of distributed computing and study the effects of commonly used mechanisms, i.e., restarts, fragmentation, replication, especially in cloud computing services. We evaluate the efficiency of these techniques under different assumptions on the data streams and discuss the corresponding optimization problem. These findings are useful for optimal resource allocation and fault tolerance in rapidly developing computing networks.;In addition to networking and distributed computing systems, the aforementioned results improve our understanding of failure recovery management in large manufacturing and service systems, e.g., call centers. Scalable solutions to this problem increase in significance as these systems continuously grow in scale and complexity. The new phenomena and the techniques developed herein provide new insights in the areas of parallel computing, probability and statistics, as well as financial engineering.
机译:诸如无线通信网络,分布式计算系统等的现代工程系统的特征在于高可变性和对故障的敏感性。需要进行故障恢复以保证这些系统的成功运行。一种简单而广泛使用的机制是在发生故障后从头开始重新启动被中断的作业。在网络设计中,重传是网络体系结构的主要构建块,可确保在出现信道故障时进行数据传递。最近,重传已被确定为现代信息网络中幂律的新来源。特别是,发现重传会导致长尾巴(延迟)并可能导致零吞吐量。为此,我们研究了“重传现象”对易发生故障的系统性能的影响,并提出了应对新出现的不稳定性的自适应解决方案。先前发现的因重传而导致的幂定律现象在数据量具有无限大的假设下成立。支持。但是实际上,数据大小的上限是0≤L≤b,例如,WaveLAN的最大传输单位是1500字节,YouTube视频的时长有限,电子邮件附件不能超过10MB,等等。为此,我们首先提供重传次数的整个分布的统一特征,可以表示为幂定律和Gamma分布的乘积。这种严格的近似清楚地说明了从主体的幂定律分布到指数尾巴的过渡。此外,结果强调了明智地确定数据片段的大小以适应这些系统中的性能需求并为该片段提供合适的工具的重要性。第二,我们将分析扩展到相关的实际重要案例使用诸如马尔可夫调制的调制过程来捕获信道,以捕获潜在的依赖性。我们的研究表明,重传和延迟分布的尾部对于信道相关性是渐近不敏感的,并且由在独立信道情况下生成最轻尾部的状态决定。这种洞察力对于容量规划和渠道建模都是有益的,因为独立的模型就足够了,相关性的细节无关紧要。但是,当最佳状态为非典型状态时,上述发现可能会过于乐观,因为“不良”状态的影响仍可能会降低性能。;第三,我们研究了调度策略在出现故障并重新启动的排队系统中的影响。公平共享,例如处理器共享(PS),是在多个用户之间进行资源分配的一种广泛接受的方法。我们将重新研究经过深入研究的M / G / 1 PS队列,并重新关注服务器故障和重新启动。有趣的是,我们发现了一个新现象,该现象表明基于PS的调度在存在重传的情况下会导致完全不稳定,而不管流量负载可能有多低。即使当作业大小是有限制的/分散的(例如确定性的)时,也会发生这种新颖的现象。这项工作表明,一次调度一个作业(例如先到先服务)可以实现更大的稳定性区域,因此在这些系统中应优先考虑。;最后,我们深入研究了分布式计算领域并研究了分布式计算的效果常用的机制,即重新启动,分段,复制,尤其是在云计算服务中。我们在数据流的不同假设下评估这些技术的效率,并讨论相应的优化问题。这些发现对于快速发展的计算网络中的最佳资源分配和容错很有用。除网络和分布式计算系统外,上述结果还使我们对大型制造和服务系统(例如呼叫中心)的故障恢复管理有了更深入的了解。随着这些系统规模和复杂性的不断提高,针对此问题的可扩展解决方案的重要性也日益提高。本文开发的新现象和技术为并行计算,概率和统计以及金融工程领域提供了新的见识。

著录项

  • 作者

    Skiani, Evangelia.;

  • 作者单位

    Columbia University.;

  • 授予单位 Columbia University.;
  • 学科 Electrical engineering.;Operations research.;Computer science.
  • 学位 Ph.D.
  • 年度 2015
  • 页码 148 p.
  • 总页数 148
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号