Accelerating content-defined-chunking based data deduplication by exploiting parallelism

Wen Xia; Dan Feng; Hong Jiang; Yucheng Zhang; Victor Chang; Xiangyu Zou

首页> 外文期刊>Future generation computer systems >Accelerating content-defined-chunking based data deduplication by exploiting parallelism

【24h】

Accelerating content-defined-chunking based data deduplication by exploiting parallelism

机译：通过利用并行性来加速基于内容定义的分块的重复数据删除

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Data deduplication, a data reduction technique that efficiently detects and eliminates redundant data chunks and files, has been widely applied in large-scale storage systems. Most existing deduplication-based storage systems employ content-defined chunking (CDC) and secure-hash-based fingerprinting (e.g., SHA1) to remove redundant data at the chunk level (e.g., 4 KB/8 KB chunks), which are extremely compute-intensive and thus time-consuming for storage systems. Therefore, we present P-Dedupe, a pipelined and parallelized data deduplication system that accelerates deduplication process by dividing the deduplication process into four stages (i.e., chunking, fingerprinting, indexing, and writing), pipelining these four stages with chunks & files (the processing data units for deduplication), and then parallelizing CDC and secure-hash-based fingerprinting stages to further alleviate the computation bottleneck. More important, to efficiently parallelize CDC with the requirements of both maximal and minimal chunk sizes and inspired by the MapReduce model, we first split the data stream into several segments (i.e., "Map"), where each segment will be running CDC in parallel with an independent thread, and then re-chunk and join the boundaries of these segments (i.e., "Reduce") to ensure the chunking effectiveness of parallelized CDC. Experimental results of P-Dedupe with eight datasets on a quad-core Intel i7 processor suggest that P-Dedupe is able to accelerate the deduplication throughput near linearly by exploiting parallelism in the CDC-based deduplication process at the cost of only 0.02% decrease in the deduplication ratio. Our work provides contributions to big data science to ensure all files go through deduplication process quickly and thoroughly, and only process and analyze the same file once, rather than multiple times.

机译：重复数据删除技术是一种有效地检测和消除冗余数据块和文件的数据缩减技术，已广泛应用于大规模存储系统中。大多数现有的基于重复数据删除的存储系统都采用基于内容的分块（CDC）和基于安全哈希的指纹识别（例如SHA1）来删除块级（例如4 KB / 8 KB块）上的冗余数据，这些数据非常计算密集的存储系统，因此非常耗时。因此，我们介绍了P-Dedupe，这是一种流水线化和并行化的重复数据删除系统，它通过将重复数据删除过程划分为四个阶段（即，分块，指纹识别，索引和写入），并使用块和文件（处理数据单元以进行重复数据删除），然后将CDC和基于安全哈希的指纹识别阶段并行化，以进一步缓解计算瓶颈。更重要的是，为了有效地将CDC与最大和最小块大小的要求并行化，并受MapReduce模型的启发，我们首先将数据流分成几个段（即“ Map”），每个段将并行运行CDC使用一个独立的线程，然后重新分块并加入这些段的边界（即“减少”），以确保并行化CDC的分块有效性。 P-Dedupe在四核Intel i7处理器上具有八个数据集的实验结果表明，P-Dedupe能够通过利用基于CDC的重复数据删除过程中的并行性，以近乎线性的方式加快重复数据删除的吞吐量，而降低的成本仅为0.02％。重复数据删除率。我们的工作为大数据科学做出了贡献，以确保所有文件快速而彻底地经历重复数据删除过程，并且仅处理和分析同一文件一次，而不是多次。

著录项

来源
《Future generation computer systems》 |2019年第9期|406-418|共13页
作者
Wen Xia; Dan Feng; Hong Jiang; Yucheng Zhang; Victor Chang; Xiangyu Zou;
展开▼
作者单位

Harbin Institute of Technology (Shenzhen) Shenzhen China Peng Cheng Laboratory Shenzhen China;

School of Computer WNLO Huazhong University of Science and Technology Wuhan China;

Department of Computer Science and Engineering University of Texas at Arlington TX USA;

Peng Cheng Laboratory Shenzhen China;

International Business School Suzhou and Research Institute of Big Data Analytics Xi'an Jiaotong-Liverpool University Suzhou China;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Content-defined chunking; Data deduplication; Backup storage systems; Performance evaluation;

机译：内容定义的分块;重复数据删除备份存储系统;绩效评估;

相似文献

外文文献
中文文献
专利

1. Exploiting SSD parallelism to accelerate application launch on SSDs [J] . Electronics Letters . 2011,第5期

机译：利用SSD并行性来加速SSD上的应用程序启动
2. Exploiting Inherent Parallelisms for Accelerating Linear Hough Transform [J] . Suchitra Sathyanarayana S., Satzoda R. K., Srikanthan T. Image Processing, IEEE Transactions on . 2009,第10期

机译：利用固有的并行性来加速线性霍夫变换
3. ICS: Interrupt-Based Channel Sneaking for Maximally Exploiting Die-Level Parallelism of NAND Flash-Based Storage Devices [J] . Juhyung Hong, Sangwoo Han, Young Min Park, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on . 2018,第9期

机译：ICS：基于中断的通道侦听功能，可最大程度地利用基于NAND闪存的存储设备的芯片级并行性
4. P-Dedupe: Exploiting Parallelism in Data Deduplication System [C] . Xia Wen, Jiang Hong, Feng Dan, 2012 7th IEEE International Conference on Networking, Architecture and Storage. . 2012

机译：P-Dedupe：在重复数据删除系统中利用并行性
5. Characterization and Exploitation of Nested Parallelism and Concurrent Kernel Execution to Accelerate High Performance Applications. [D] . Nina Paravecino, Fanny. 2017

机译：嵌套并行和并行内核执行的特性和开发，以加速高性能应用程序。
6. Exploiting Thread-Level and Instruction-Level Parallelism to Cluster Mass Spectrometry Data using Multicore Architectures [O] . Fahad Saeed, Jason D. Hoffert, Trairak Pisitkun, -1

机译：利用多核体系结构利用线程级和指令级并行性对质谱数据进行聚类
7. Exploiting Dataflow to Extract Java Instruction Level Parallelism on a Tag-based Multi-Issue Semi In-Order (TMSI) Processor [O] . 2008

机译：利用数据流在基于标签的多问题半有序（TMSI）处理器上提取Java指令级并行性
8. Exposing and Exploiting Internal Parallelism in MEMS-Based Storage [R] . Schlosser, S. W., Schindler, J., Ailamaki, A., 2003

机译：在mEms存储中揭示和利用内部并行性

Accelerating content-defined-chunking based data deduplication by exploiting parallelism

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅