首页> 外文会议>IEEE International Parallel Distributed Processing Symposium >Communication-Efficient Distributed Variance Monitoring and Outlier Detection for Multivariate Time Series
【24h】

Communication-Efficient Distributed Variance Monitoring and Outlier Detection for Multivariate Time Series

机译:多元时间序列的通信高效分布式方差监视和异常值检测

获取原文

摘要

Modern scale-out services are comprised of thousands of individual machines, which must be continuously monitored for unexpected failures. One recent approach to monitoring is latent fault detection, an adaptive statistical framework for scale-out, load-balanced systems. By periodically measuring hundreds of performance metrics and looking for outlier machines, it attempts to detect subtle problems such as misconfigurations, bugs, and malfunctioning hardware, before they manifest as machine failures. Previous work on a large, real-world Web service has shown that many failures are indeed preceded by such latent faults. Latent fault detection is an offline framework with large bandwidth and processing requirements. Each machine must send all its measurements to a centralized location, which is prohibitive in some settings and requires data-parallel processing infrastructure. In this work we adapt the latent fault detector to provide an online, communication- and computation-reduced version. We utilize stream processing techniques to trade accuracy for communication and computation. We first describe a novel communication-efficient online distributed variance monitoring algorithm that provides a continuous estimate of the global variance within guaranteed approximation bounds. Using the variance monitor, we provide an online distributed outlier detection framework for non-stationary multivariate time series common in scale-out systems. The adapted framework reduces data size and central processing cost by processing the data in situ, making it usable in wider settings. Like the original framework, our adaptation admits different comparison functions, supports non-stationary data, and provides statistical guarantees on the rate of false positives. Simulations on logs from a production system show that we are able to reduce bandwidth by an order of magnitude, with below 1% error compared to the original algorithm.
机译:现代的横向扩展服务由数千台单独的机器组成,必须对其进行连续监视以防意外故障。一种最新的监视方法是潜在故障检测,这是一种适用于横向扩展,负载均衡系统的自适应统计框架。通过定期测量数百个性能指标并查找异常机器,它会尝试检测细微问题,例如配置错误,错误和硬件故障,然后将其表现为机器故障。大型的实际Web服务的先前工作表明,许多故障确实是由此类潜在故障引起的。潜在故障检测是一种离线框架,具有较大的带宽和处理要求。每台机器必须将其所有测量结果发送到一个集中的位置,这在某些设置中是禁止的,并且需要数据并行处理基础结构。在这项工作中,我们对潜在故障检测器进行了改进,以提供一种在线的,减少通信和计算量的版本。我们利用流处理技术来交换通信和计算的准确性。我们首先描述一种新颖的通信有效的在线分布式方差监视算法,该算法在保证的近似范围内提供全局方差的连续估计。使用方差监视器,我们为横向扩展系统中常见的非平稳多元时间序列提供了一个在线分布式离群值检测框架。适应性强的框架通过就地处理数据减少了数据大小和中央处理成本,使其可在更广泛的环境中使用。像原始框架一样,我们的适应方法也接受不同的比较功能,支持非平稳数据,并为误报率提供统计保证。对来自生产系统的日志的仿真表明,与原始算法相比,我们能够将带宽减少一个数量级,并且误差低于1%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号