首页> 外文学位 >Data-driven approaches to improve dependability of cloud services.
【24h】

Data-driven approaches to improve dependability of cloud services.

机译:数据驱动的方法来改善云服务的可靠性。

获取原文
获取原文并翻译 | 示例

摘要

The growing demand for always-on and low-latency cloud services is driving the creation of globally distributed datacenters. A major factor affecting service availability is reliability of the network, both inside the datacenters and wide-area links connecting them. While several research efforts focus on building scale-out datacenter networks, little has been reported on real network failures and how they impact geo-distributed services. Towards improving the dependability of the underlying datacenter networks, in this dissertation, we make one of the first attempts to characterize intra-datacenter and inter-datacenter network failures from a service perspective. Specifically, we make the following contributions: 1. Analysis Methodology for Structured Data: Our dataset includes multiple sources of structured network telemetry data spanning three years logged in monitoring servers of a large cloud provider comprising 100k+ servers, 10k+ core network devices, 2k+ middleboxes and 100k+ network links across 10+ datacenters. This dataset covers a wide range of network data sources, including syslog and SNMP alerts, and traffic carried by links. To this end, we describe a systematic methodology for analyzing this structured data based on event processing to extract events having service-level impact. 2. Analysis Methodology for Unstructured Data Our dataset also includes an important piece of operational knowledge -- network trouble tickets, which are diaries written by network operators to keep track of their troubleshooting efforts while fixing a problem. To this end, we take a practical step towards automatically analyzing natural language text in network trouble tickets to infer the problem symptoms, troubleshooting activities and resolution actions. Our system, NetSieve combines statistical natural language processing (NLP), knowledge representation, and ontology modeling to achieve these goals. 3. Data-Driven Approaches to Deriving Actionable Insights: Our overarching goal in this dissertation is to enable operators to understand global problem trends instead of making decisions based on isolated incidents. We outline several analyses rooted in reliability analysis and applied statistics for characterizing network failures and deriving actionable insights from them. Our study reveals several important findings on (a) the failure characteristics of network elements, (b) the availability of network domains, (c) service impact, (d) causes of network failures, (e) effectiveness of repairs, and (f) modeling failures.;As part of this dissertation, we have built a broad range of systems including real-time network dashboards, a big data analytics system for analyzing network telemetry data, and an inference tool for root cause analysis in network troubleshooting. Several components of the dissertation work either have undergone a tech-transfer or are being used by multiple business groups inside Microsoft. NetWiser, a Microsoft Research project entailing this dissertation, was awarded the Microsoft Trustworthy Computing Reliability Award for 2013.;The problem inference system part of this dissertation, NetSieve, is currently being used across different teams within Microsoft to improve network management: the Network Architecture team for comparing device reliability across platforms and vendors, the Capacity Planning team for understanding why network redundancy is ineffective in masking failures, and the Incident Management and Operations team for finding the top-k problems and failing components while troubleshooting devices and determining whether past repairs were effective. Since its inception, NetSieve has also been used to automate root cause analysis of security incidents within Microsoft's datacenters and recently found its way into commercial use through Microsoft's System Center Advisor (http://www.systemcenteradvisor.com).
机译:对始终在线和低延迟云服务的需求不断增长,这推动了全球分布数据中心的创建。影响服务可用性的主要因素是数据中心内部和连接它们的广域链路的网络可靠性。尽管一些研究工作集中在构建横向扩展数据中心网络上,但有关实际网络故障以及它们如何影响地理分布服务的报道很少。为了提高底层数据中心网络的可靠性,本文是从服务的角度来描述数据中心内部和数据中心之间网络故障的首次尝试之一。具体来说,我们做出了以下贡献:1.结构化数据的分析方法:我们的数据集包括跨大型网络提供商的监控服务器(记录了三年)的多种结构化网络遥测数据源,包括100k +服务器,10k +核心网络设备,2k +中间盒和跨10多个数据中心的10万多个网络链接。该数据集涵盖了广泛的网络数据源,包括syslog和SNMP警报以及链接承载的流量。为此,我们描述了一种用于基于事件处理来分析此结构化数据的系统方法,以提取具有服务级别影响的事件。 2.非结构化数据的分析方法论我们的数据集还包括重要的操作知识-网络故障单,网络故障单是网络运营商编写的日记,用于在解决问题时跟踪其故障排除工作。为此,我们朝着自动分析网络故障单中的自然语言文本以推断出问题症状,故障排除活动和解决措施迈出了实际的一步。我们的系统NetSieve结合了统计自然语言处理(NLP),知识表示和本体建模来实现这些目标。 3.数据驱动的方法以得出可行的见解:本文的总体目标是使操作员能够了解全球问题趋势,而不是根据孤立的事件做出决策。我们概述了一些基于可靠性分析和应用统计的分析,这些分析用于表征网络故障并从中得出可行的见解。我们的研究揭示了以下几个重要发现:(a)网络元素的故障特征,(b)网络域的可用性,(c)服务影响,(d)网络故障的原因,(e)维修的有效性和(f )建立故障模型。作为本文的一部分,我们建立了广泛的系统,包括实时网络仪表板,用于分析网络遥测数据的大数据分析系统以及用于网络故障排除的根本原因分析的推理工具。论文工作的几个组成部分已经过技术转让,或者被Microsoft内部的多个业务部门使用。 NetWiser是需要本学位论文的Microsoft Research项目,被授予2013年Microsoft可信计算可靠性奖。该论文的问题推理系统部分NetSieve当前在Microsoft内部的不同团队中用于改善网络管理:网络体系结构团队用于比较平台和供应商之间的设备可靠性;容量规划团队用于了解网络冗余为何无法有效地掩盖故障;事件管理与运营团队用于在排除设备故障并确定是否经过维修后发现排在前k位的问题和组件故障是有效的。自成立以来,NetSieve还被用于自动化对Microsoft数据中心内安全事件的根本原因分析,并且最近通过Microsoft的System Center Advisor(http://www.systemcenteradvisor.com)进入了商业用途。

著录项

  • 作者

    Potharaju, Rahul.;

  • 作者单位

    Purdue University.;

  • 授予单位 Purdue University.;
  • 学科 Computer science.;Information Technology.;Information science.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 158 p.
  • 总页数 158
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号