首页> 外文会议>Proceedints of the 6th USENIX Conference on File and Storage Technologies(FAST'08) >Avoiding the Disk Bottleneck in the Data Domain Deduplication File System
【24h】

Avoiding the Disk Bottleneck in the Data Domain Deduplication File System

机译:避免Data Domain重复数据删除文件系统中的磁盘瓶颈

获取原文
获取原文并翻译 | 示例

摘要

Disk-based deduplication storage has emerged as the new-generation storage system for enterprise data protection to replace tape libraries. Deduplication removes redundant data segments to compress data into a highly compact form and makes it economical to store backups on disk instead of tape. A crucial requirement for enterprise data protection is high throughput, typically over 100 MB/sec, which enables backups to complete quickly. A significant challenge is to identify and eliminate duplicate data segments at this rate on a low-cost system that cannot afford enough RAM to store an index of the stored segments and may be forced to access an on-disk index for every input segment. This paper describes three techniques employed in the production Data Domain deduplication file system to relieve the disk bottleneck. These techniques include: (1) the Summary Vector, a compact in-memory data structure for identifying new segments; (2) Stream-Informed Segment Layout, a data layout method to improve on-disk locality for sequentially accessed segments; and (3) Locality Preserved Caching, which maintains the locality of the fingerprints of duplicate segments to achieve high cache hit ratios. Together, they can remove 99% of the disk accesses for deduplication of real world workloads. These techniques enable a modern two-socket dual-core system to run at 90% CPU utilization with only one shelf of 15 disks and achieve 100 MB/sec for single-stream throughput and 210 MB/sec for multi-stream throughput.
机译:基于磁盘的重复数据删除存储已成为用于企业数据保护以替代磁带库的新一代存储系统。重复数据删除可以删除多余的数据段,从而将数据压缩为高度紧凑的格式,从而可以经济地将备份存储在磁盘而不是磁带上。企业数据保护的一项关键要求是高吞吐量(通常超过100 MB /秒),这使备份能够快速完成。一个巨大的挑战是在低成本系统上以这种速率识别和消除重复的数据段,该系统无法提供足够的RAM来存储所存储段的索引,并且可能被迫访问每个输入段的磁盘索引。本文介绍了在生产Data Domain重复数据删除文件系统中使用的三种技术来缓解磁盘瓶颈。这些技术包括:(1)摘要向量,一种紧凑的内存数据结构,用于识别新段; (2)流信息段布局,一种数据布局方法,用于改善顺序访问的段在磁盘上的局部性; (3)局部性保留缓存,保持重复段指纹的局部性,以实现较高的缓存命中率。他们可以共同删除99%的磁盘访问权限,以消除实际工作负载中的重复数据。这些技术使现代的两路双路双核系统能够以90%的CPU使用率运行,而只有一个15盘的磁盘架,单流吞吐量达到100 MB /秒,多流吞吐量达到210 MB /秒。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号