【24h】

Failure Order: A Missing Piece in Disk Failure Processing of Data Centers

机译:故障顺序:数据中心磁盘故障处理中的缺失部分

获取原文

摘要

To avoid data loss, data centers adopt disk failure prediction (DFP) technology to raise warnings ahead of actual disk failures, and process the warnings in the order they are raised, i.e., a first-in-first-out (FIFO) warning order. The FIFO-guided warning order can process warnings timely when disk failures are rare in data centers. With the growing scale of data centers, the increasing number of disk failures leads to a complex situation that multiple warnings are raised simultaneously, where the FIFO-guided warning order neither processes warnings timely, nor manages warnings properly due to lack of the priority of warnings. Thus, a real-time and finer-grained priority guidance for warning order management is an urgent need. To this end, we turn our attention to the failures since each warning corresponds to a fail event. The key insight is that the interdependence of failures, i.e., the order failure occurred, indicates the order of warning processing. With an accurate failure order, data centers can decrease the probability of data loss and the downtime of latency-sensitive applications by processing urgent warnings in advance. In this paper, we predict the failure order with a LambdaMART model, which is a state-of-the-art ranking algorithm in information retrieval. To avoid overly concerning on the correctness of high-rank warnings in information retrieval, we design a symmetric metric to evaluate the prediction evaluation of failure order. Experiment on a public dataset, provided by the Backblaze company, shows that our model outperforms the FIFO order and the order from previous DFP models.
机译:为了避免数据丢失,数据中心采用磁盘故障预测(DFP)技术在实际磁盘故障之前发出警告,并按照引发的顺序(即先进先出(FIFO)警告顺序)处理警告。 。当在数据中心很少发生磁盘故障时,FIFO指导的警告命令可以及时处理警告。随着数据中心规模的扩大,磁盘故障数量的增加导致同时发出多个警告的复杂情况,在这种情况下,FIFO引导的警告顺序由于缺少警告优先级而无法及时处理警告,也无法正确管理警告。因此,迫切需要用于警告顺序管理的实时且细粒度的优先级指导。为此,我们将注意力转移到失败上,因为每个警告都对应一个失败事件。关键的见解是,故障的相互依赖性,即发生的故障顺序,指示了警告处理的顺序。通过准确的故障顺序,数据中心可以通过预先处理紧急警告来降低数据丢失的可能性和对延迟敏感的应用程序的停机时间。在本文中,我们使用LambdaMART模型预测故障顺序,该模型是信息检索中的最新技术。为了避免过度关注信息检索中高级警报的正确性,我们设计了一个对称度量来评估故障顺序的预测评估。由Backblaze公司提供的公共数据集上的实验表明,我们的模型优于FIFO顺序和先前DFP模型的顺序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号