首页> 外文会议>Data Compression Conference, 2009. DCC '09 >Block Size Optimization in Deduplication Systems
【24h】

Block Size Optimization in Deduplication Systems

机译:重复数据删除系统中的块大小优化

获取原文

摘要

Data deduplication is a popular dictionary based compression method in storage archival and backup. The deduplication efficiency improves for smaller chunk sizes, however the files become highly fragmented requiring many disk accesses during reconstruction or chattiness in a client-server architecture. Within the sequence of chunks that an object (file) is decomposed into, sub-sequences of adjacent chunks tend to repeat. We exploit this insight to optimize the chunk sizes by joining repeated sub-sequences of small chunks into new super chunks with the constraint to achieve practically the same matching performance. We employ suffix arrays to find these repeating sub-sequences and to determine a new encoding that covers the original sequence. With super chunks we significantly reduce fragmentation, improving reconstruction time and the overall deduplication ratio by lowering the amount of metadata. As a result, fewer chunks are used to represent a file, reducing the number of disk accesses needed to reconstruct the file and requiring fewer entries in the chunk dictionary and fewer hashes to encode a file. To encode a sequence of chunks we proved two facts: (1) any subsequence that repeats is part of some super chunk (su- permaximal in Figure 1) - therefore the dictionary contains just supermaximals and non-repeats, and (2) maximals are the only repeats not covered (overlapped) by the containing supermaximals - so once we discover the maximals with the suffix array, and encode them, there is no need for maintaining auxiliary data structures like bit masks, to guarantee the coverage (encoding) of the entire object. Our experimental evaluation (Figure 2) yields a reduction in fragmentation between 89%-97% and a reduction of dictionary metadata (number of entries) between 80%-97%, without increasing the total amount of storage required for unique chunks.
机译:重复数据删除是一种在存储归档和备份中基于字典的流行压缩方法。对于较小的块大小,重复数据删除效率有所提高,但是文件变得高度碎片化,在客户端-服务器体系结构中的重建或聊天过程中,需要进行许多磁盘访问。在对象(文件)被分解为块的序列中,相邻块的子序列趋于重复。我们利用这种见解来优化小块大小,方法是将小块的重复子序列合并到新的超级块中,并施加约束,以实现几乎相同的匹配性能。我们使用后缀数组来查找这些重复的子序列,并确定覆盖原始序列的新编码。使用超级块,我们可以通过减少元数据量来显着减少碎片,缩短重建时间并提高整体重复数据删除率。结果,使用较少的块来表示文件,从而减少了重建文件所需的磁盘访问次数,并减少了块字典中的条目,并减少了对文件进行编码的哈希值。为了编码一组块,我们证明了两个事实:(1)重复的任何子序列都是某些超级块的一部分(图1中的最大)-因此,字典仅包含最大和非重复,并且(2)最大为唯一的重复没有被包含的最大值所覆盖(重叠)-因此,一旦我们发现带有后缀数组的最大值,并对它们进行编码,就无需维护诸如位掩码之类的辅助数据结构,以确保覆盖范围(编码)整个对象。我们的实验评估(图2)在不增加唯一块所需的总存储量的情况下,减少了89%-97%之间的碎片,减少了字典元数据(条目数)在80%-97%之间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号