首页> 外文期刊>ACM Queue: Architecting Tomorrow s Computing >Building a distributed system requires a methodical approach to requirements.
【24h】

Building a distributed system requires a methodical approach to requirements.

机译:构建分布式系统需要有条不紊地满足要求。

获取原文
           

摘要

Mark Cavage Distributed systems are difficult to understand, design, build, and operate. They introduce exponentially more variables into a design than a single machine does, making the root cause of an application problem much harder to discover. It should be said that if an application does not have meaningful SLAs (service-level agreements) and can tolerate extended downtime and/or performance degradation, then the barrier to entry is greatly reduced. Most modern applications, however, have an expectation of resiliency from their users, and SLAs are typically measured by "the number of nines" (e.g., 99.9 or 99.99 percent availability per month). Each additional 9 becomes harder and harder to achieve. To complicate matters further, it is extremely common that distributed failures will manifest as intermittent errors or decreased performance (commonly known as brownouts). These failure modes are much more time-consuming to diagnose than a complete failure. For example, Joyent operates several distributed systems as part of its cloud-computing infrastructure. In one such system—a highly available, distributed key/value store—Joyent recently experienced transient application timeouts. For most users the system operated normally and responded within the bounds of its latency SLA. However, 5-10 percent of requests exceeded a predefined application timeout. The failures were not reproducible in development or test environments, and they would often "go away" for minutes to hours at a time. Troubleshooting this problem to root cause required extensive system analysis of the data-storage API (node.js), an RDBMS (relational database management system) used internally by the system (PostgreSQL), the operating system, and the end-user application that relied on the key/value system. Ultimately, the root problem was in application semantics that caused excessive locking, but determining the root cause required considerable data gathering and correlation, and consumed many working hours of time among engineers with differing areas of expertise.
机译:Mark Cavage分布式系统很难理解,设计,构建和操作。与单台机器相比,它们在设计中引入了更多的变量,这使得很难发现应用程序问题的根本原因。应该说,如果应用程序没有有意义的SLA(服务级别协议),并且可以忍受延长的停机时间和/或性能下降,那么进入的障碍将大大减少。但是,大多数现代应用程序都希望其用户具有弹性,并且SLA通常以“九位数”(例如,每月99.9%或99.99%的可用性)来衡量。每增加9个就变得越来越难。更为复杂的是,分布式故障会表现为间歇性错误或性能下降(通常称为电源不足),这是非常普遍的。这些故障模式比完全故障要耗费更多的时间进行诊断。例如,Joyent作为其云计算基础架构的一部分,运营着多个分布式系统。在一个这样的系统(一个高度可用的分布式键/值存储)中,Joyent最近经历了短暂的应用程序超时。对于大多数用户而言,系统正常运行并在其延迟SLA的范围内做出响应。但是,有5-10%的请求超过了预定义的应用程序超时。这些故障在开发或测试环境中无法重现,并且往往一次只能消失几分钟甚至几个小时。从根本上解决该问题,需要对数据存储API(node.js),系统(PostgreSQL)内部使用的RDBMS(关系数据库管理系统),操作系统以及最终用户应用程序进行广泛的系统分析依赖于键/值系统。最终,根本问题是导致过度锁定的应用程序语义,但是确定根本原因需要大量的数据收集和关联,并且会耗费不同专业领域的工程师大量的工作时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号