首页> 外文会议>Data Compression Conference, 2009. DCC '09 >Block Size Optimization in Deduplication Systems

【24h】

Block Size Optimization in Deduplication Systems

机译：重复数据删除系统中的块大小优化

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data deduplication is a popular dictionary based compression method in storage archival and backup. The deduplication efficiency improves for smaller chunk sizes, however the files become highly fragmented requiring many disk accesses during reconstruction or chattiness in a client-server architecture. Within the sequence of chunks that an object (file) is decomposed into, sub-sequences of adjacent chunks tend to repeat. We exploit this insight to optimize the chunk sizes by joining repeated sub-sequences of small chunks into new super chunks with the constraint to achieve practically the same matching performance. We employ suffix arrays to find these repeating sub-sequences and to determine a new encoding that covers the original sequence. With super chunks we significantly reduce fragmentation, improving reconstruction time and the overall deduplication ratio by lowering the amount of metadata. As a result, fewer chunks are used to represent a file, reducing the number of disk accesses needed to reconstruct the file and requiring fewer entries in the chunk dictionary and fewer hashes to encode a file. To encode a sequence of chunks we proved two facts: (1) any subsequence that repeats is part of some super chunk (su- permaximal in Figure 1) - therefore the dictionary contains just supermaximals and non-repeats, and (2) maximals are the only repeats not covered (overlapped) by the containing supermaximals - so once we discover the maximals with the suffix array, and encode them, there is no need for maintaining auxiliary data structures like bit masks, to guarantee the coverage (encoding) of the entire object. Our experimental evaluation (Figure 2) yields a reduction in fragmentation between 89%-97% and a reduction of dictionary metadata (number of entries) between 80%-97%, without increasing the total amount of storage required for unique chunks.

机译：重复数据删除是一种在存储归档和备份中基于字典的流行压缩方法。对于较小的块大小，重复数据删除效率有所提高，但是文件变得高度碎片化，在客户端-服务器体系结构中的重建或聊天过程中，需要进行许多磁盘访问。在对象（文件）被分解为块的序列中，相邻块的子序列趋于重复。我们利用这种见解来优化小块大小，方法是将小块的重复子序列合并到新的超级块中，并施加约束，以实现几乎相同的匹配性能。我们使用后缀数组来查找这些重复的子序列，并确定覆盖原始序列的新编码。使用超级块，我们可以通过减少元数据量来显着减少碎片，缩短重建时间并提高整体重复数据删除率。结果，使用较少的块来表示文件，从而减少了重建文件所需的磁盘访问次数，并减少了块字典中的条目，并减少了对文件进行编码的哈希值。为了编码一组块，我们证明了两个事实：（1）重复的任何子序列都是某些超级块的一部分（图1中的最大）-因此，字典仅包含最大和非重复，并且（2）最大为唯一的重复没有被包含的最大值所覆盖（重叠）-因此，一旦我们发现带有后缀数组的最大值，并对它们进行编码，就无需维护诸如位掩码之类的辅助数据结构，以确保覆盖范围（编码）整个对象。我们的实验评估（图2）在不增加唯一块所需的总存储量的情况下，减少了89％-97％之间的碎片，减少了字典元数据（条目数）在80％-97％之间。

著录项

来源
《Data Compression Conference, 2009. DCC '09 》|2009年|442-442|共1页
会议地点 Snowbird UT(US);Snowbird UT(US)
作者
Constantinescu C.; Pieper J.; Tiancheng Li;
展开▼
作者单位

Almaden Res. Center, IBM, San Jose, CA;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
client-server systems; data compression; data structures; dictionaries; information retrieval systems; optimisation; auxiliary data structures; block size optimization; client-server architecture; data deduplication; deduplication systems; dictionary based compression method; disk accesses; storage archival; File chunking; Information archiving; Storage backup;

机译：客户-服务器系统;数据压缩;数据结构;字典;信息检索系统;优化;辅助数据结构;块大小优化;客户-服务器体系结构;重复数据删除;重复数据删除系统;基于字典的压缩方法;磁盘访问;存储档案;文件分块；信息归档；存储备份;

相似文献

外文文献
中文文献
专利

1. Dynamic determination of variable sizes of chunks in a deduplication system [J] . Hirsch Michael, Klein Shmuel T., Shapira Dana, Discrete Applied Mathematics . 2020 ,第1期

机译：重复数据删除系统中块的可变大小的动态确定
2. Optimizing the restoration performance of deduplication systems through an energy-saving data layout [J] . Yan Fang, Yang Xi, Liu Jiamou, Annales des Telecommunications . 2019 ,第7a8期

机译：通过节能的数据布局优化重复数据删除系统的恢复性能
3. Data deduplication for data optimization for storage and network systems [J] . G. K. Jenkins Computing reviews . 2017 ,第7期

机译：重复数据删除技术可优化存储和网络系统的数据
4. Block Size Optimization in Deduplication Systems [C] . Constantinescu C., Pieper J., Tiancheng Li Data Compression Conference, 2009. DCC '09 . 2009

机译：重复数据删除系统中的块大小优化
5. Designing Highly-Efficient Deduplication Systems with Optimized Computation and I/O Operations [D] . ?Ni, Fan 2019

机译：使用优化的计算和I / O操作设计高效的重复数据删除系统
6. Static Memory Deduplication for Performance Optimization in Cloud Computing [O] . Gangyong Jia, Guangjie Han, Hao Wang, 2017

机译：静态内存重复数据删除技术在云计算中的性能优化
7. Blockchain-Based Fair Payment Protocol for Deduplication Cloud Storage System [O] . Shangping Wang, Yuying Wang, Yaling Zhang 2019

机译：基于区块的重复数据删除云存储系统的公平支付协议

Block Size Optimization in Deduplication Systems

摘要

著录项

相似文献

相关主题

期刊订阅