首页> 外文期刊>IEICE transactions on information and systems >Reliability and Failure Impact Analysis of Distributed Storage Systems with Dynamic Refuging
【24h】

Reliability and Failure Impact Analysis of Distributed Storage Systems with Dynamic Refuging

机译:具有动态重排的分布式存储系统的可靠性和故障影响分析

获取原文
获取外文期刊封面目录资料

摘要

In recent data centers, large-scale storage systems storing big data comprise thousands of large-capacity drives. Our goal is to establish a method for building highly reliable storage systems using more than a thousand low-cost large-capacity drives. Some large-scale storage systems protect data by erasure coding to prevent data loss. As the redundancy level of erasure coding is increased, the probability of data loss will decrease, but the increase in normal data write operation and additional storage for coding will be incurred. We therefore need to achieve high reliability at the lowest possible redundancy level. There are two concerns regarding reliability in large-scale storage systems: (i) as the number of drives increases, systems are more subject to multiple drive failures and (ii) distributing stripes among many drives can speed up the rebuild time but increase the risk of data loss due to multiple drive failures. If data loss occurs by multiple drive failure, it affects many users using a storage system. These concerns were not addressed in prior quantitative reliability studies based on realistic settings. In this work, we analyze the reliability of large-scale storage systems with distributed stripes, focusing on an effective rebuild method which we call Dynamic Refuging. Dynamic Refuging rebuilds failed blocks from those with the lowest redundancy and strategically selects blocks to read for repairing lost data. We modeled the dynamic change of amount of storage at each redundancy level caused by multiple drive failures, and performed reliability analysis with Monte Carlo simulation using realistic drive failure characteristics. We showed a failure impact model and a method for localizing the failure. When stripes with redundancy level 3 were sufficiently distributed and rebuilt by Dynamic Refuging, the proposed technique turned out to scale well, and the probability of data loss decreased by two orders of magnitude for systems with a thousand drives compared to normal RAID. The appropriate setting of a stripe distribution level could localize the failure.
机译:在最近的数据中心中,存储大数据的大规模存储系统包含数千个大容量驱动器。我们的目标是建立一种使用上千个低成本大容量驱动器构建高度可靠的存储系统的方法。一些大型存储系统通过擦除编码来保护数据,以防止数据丢失。随着擦除编码的冗余级别的增加,数据丢失的可能性将降低,但是将导致正常数据写操作和用于编码的额外存储量的增加。因此,我们需要在尽可能低的冗余级别上实现高可靠性。大型存储系统的可靠性存在两个问题:(i)随着驱动器数量的增加,系统更容易遭受多个驱动器故障的影响;(ii)在多个驱动器之间分布条带可以加快重建时间,但会增加风险多个驱动器故障导致的数据丢失。如果由于多个驱动器故障而导致数据丢失,那么它将影响使用存储系统的许多用户。在基于实际设置的先前定量可靠性研究中,未解决这些问题。在这项工作中,我们分析了具有分布式条带的大型存储系统的可靠性,重点研究了一种有效的重建方法,我们将其称为“动态清除”。 Dynamic Refuging从冗余度最低的块中重建故障块,并从策略上选择要读取的块以修复丢失的数据。我们对由多个驱动器故障引起的每个冗余级别的存储量动态变化进行了建模,并使用实际的驱动器故障特征通过蒙特卡洛模拟进行了可靠性分析。我们展示了一个故障影响模型和一种定位故障的方法。当具有3级冗余的条带被充分分配并通过动态重整进行重建时,所提出的技术可以很好地扩展,并且与普通RAID相比,具有1000个驱动器的系统的数据丢失几率降低了两个数量级。条带分布级别的适当设置可以定位故障。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号