首页> 外文期刊>Future generation computer systems >Accelerating content-defined-chunking based data deduplication by exploiting parallelism
【24h】

Accelerating content-defined-chunking based data deduplication by exploiting parallelism

机译:通过利用并行性来加速基于内容定义的分块的重复数据删除

获取原文
获取原文并翻译 | 示例
           

摘要

Data deduplication, a data reduction technique that efficiently detects and eliminates redundant data chunks and files, has been widely applied in large-scale storage systems. Most existing deduplication-based storage systems employ content-defined chunking (CDC) and secure-hash-based fingerprinting (e.g., SHA1) to remove redundant data at the chunk level (e.g., 4 KB/8 KB chunks), which are extremely compute-intensive and thus time-consuming for storage systems. Therefore, we present P-Dedupe, a pipelined and parallelized data deduplication system that accelerates deduplication process by dividing the deduplication process into four stages (i.e., chunking, fingerprinting, indexing, and writing), pipelining these four stages with chunks & files (the processing data units for deduplication), and then parallelizing CDC and secure-hash-based fingerprinting stages to further alleviate the computation bottleneck. More important, to efficiently parallelize CDC with the requirements of both maximal and minimal chunk sizes and inspired by the MapReduce model, we first split the data stream into several segments (i.e., "Map"), where each segment will be running CDC in parallel with an independent thread, and then re-chunk and join the boundaries of these segments (i.e., "Reduce") to ensure the chunking effectiveness of parallelized CDC. Experimental results of P-Dedupe with eight datasets on a quad-core Intel i7 processor suggest that P-Dedupe is able to accelerate the deduplication throughput near linearly by exploiting parallelism in the CDC-based deduplication process at the cost of only 0.02% decrease in the deduplication ratio. Our work provides contributions to big data science to ensure all files go through deduplication process quickly and thoroughly, and only process and analyze the same file once, rather than multiple times.
机译:重复数据删除技术是一种有效地检测和消除冗余数据块和文件的数据缩减技术,已广泛应用于大规模存储系统中。大多数现有的基于重复数据删除的存储系统都采用基于内容的分块(CDC)和基于安全哈希的指纹识别(例如SHA1)来删除块级(例如4 KB / 8 KB块)上的冗余数据,这些数据非常计算密集的存储系统,因此非常耗时。因此,我们介绍了P-Dedupe,这是一种流水线化和并行化的重复数据删除系统,它通过将重复数据删除过程划分为四个阶段(即,分块,指纹识别,索引和写入),并使用块和文件(处理数据单元以进行重复数据删除),然后将CDC和基于安全哈希的指纹识别阶段并行化,以进一步缓解计算瓶颈。更重要的是,为了有效地将CDC与最大和最小块大小的要求并行化,并受MapReduce模型的启发,我们首先将数据流分成几个段(即“ Map”),每个段将并行运行CDC使用一个独立的线程,然后重新分块并加入这些段的边界(即“减少”),以确保并行化CDC的分块有效性。 P-Dedupe在四核Intel i7处理器上具有八个数据集的实验结果表明,P-Dedupe能够通过利用基于CDC的重复数据删除过程中的并行性,以近乎线性的方式加快重复数据删除的吞吐量,而降低的成本仅为0.02%。重复数据删除率。我们的工作为大数据科学做出了贡献,以确保所有文件快速而彻底地经历重复数据删除过程,并且仅处理和分析同一文件一次,而不是多次。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号