A new content-defined chunking algorithm for data deduplication in cloud storage

Ryan N.S. Widodo; Hyotaek Lim; Mohammed Atiquzzaman

首页> 外文期刊>Future generation computer systems >A new content-defined chunking algorithm for data deduplication in cloud storage

【24h】

A new content-defined chunking algorithm for data deduplication in cloud storage

机译：一种新的内容定义的分块算法，用于云存储中的重复数据删除

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Chunking is a process to split a file into smaller files called chunks. In some applications, such as remote data compression, data synchronization, and data deduplication, chunking is important because it determines the duplicate detection performance of the system. Content-defined chunking (CDC) is a method to split files into variable length chunks, where the cut points are defined by some internal features of the files. Unlike fixed-length chunks, variable-length chunks are more resistant to byte shifting. Thus, it increases the probability of finding duplicate chunks within a file and between files. However, CDC algorithms require additional computation to find the cut points which might be computationally expensive for some applications. In our previous work (Widodo et al., 2016), the hash-based CDC algorithm used in the system took more process time than other processes in the deduplication system. This paper proposes a high throughput hash-less chunking method called Rapid Asymmetric Maximum (RAM). Instead of using hashes, RAM uses bytes value to declare the cut points. The algorithm utilizes a fix-sized window and a variable-sized window to find a maximum-valued byte which is the cut point. The maximum-valued byte is included in the chunk and located at the boundary of the chunk. This configuration allows RAM to do fewer comparisons while retaining the CDC property. We compared RAM with existing hash-based and hash-less deduplication systems. The experimental results show that our proposed algorithm has higher throughput and bytes saved per second compared to other chunking algorithms.

机译：分块是将文件拆分为较小文件（称为块）的过程。在某些应用程序中，例如远程数据压缩，数据同步和重复数据删除，分块很重要，因为分块确定了系统的重复检测性能。内容定义的数据块（CDC）是一种将文件拆分为可变长度的数据块的方法，其中切入点由文件的某些内部功能定义。与固定长度的块不同，可变长度的块更耐字节移位。因此，它增加了在文件内以及文件之间发现重复块的可能性。但是，CDC算法需要额外的计算才能找到切入点，这对于某些应用程序可能在计算上昂贵。在我们之前的工作中（Widodo et al。，2016），系统中使用的基于哈希的CDC算法比重复数据删除系统中的其他进程花费更多的处理时间。本文提出了一种称为Rapid Asymmetric Maximum（RAM）的高吞吐量无哈希分块方法。 RAM不使用散列，而使用字节值来声明切点。该算法利用固定大小的窗口和可变大小的窗口来找到最大字节，该最大字节是切点。最大值字节包含在块中，并位于块的边界处。此配置允许RAM进行较少的比较，同时保留CDC属性。我们将RAM与现有的基于散列和无散列的重复数据删除系统进行了比较。实验结果表明，与其他分块算法相比，我们提出的算法具有更高的吞吐量和每秒节省的字节数。

著录项

来源
《Future generation computer systems》 |2017年第6期|145-156|共12页
作者
Ryan N.S. Widodo; Hyotaek Lim; Mohammed Atiquzzaman;
展开▼
作者单位

Department of Ubiquitous IT, Dongseo University, 617-716 Busan, South Korea;

Division of Computer Engineering, Dongseo University, 617-716 Busan, South Korea;

School of Computer Science, University of Oklahoma, Norman, OK 73019, United States;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Data deduplication; Cloud storage; Content-defined chunking; Hash-less chunking; Asymmetric window;

机译：重复数据删除云储存;内容定义的分块;无哈希分块;不对称窗口;
入库时间 2022-08-18 02:16:24

相似文献

外文文献
中文文献
专利

1. The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems [J] . Xia Wen, Zou Xiangyu, Jiang Hong, IEEE Transactions on Parallel and Distributed Systems . 2020,第9期

机译：基于数据重复数据删除的存储系统的快速内容定义块的设计
2. Implementation of a deduplication cache mechanism using content-defined chunking [J] . Yoshihiro Oyama, Jun Murakami, Shun Ishiguro, International Journal of High Performance Computing and Networking . 2016,第3期

机译：使用内容定义的分块实现重复数据删除缓存机制
3. A Fast Asymmetric Extremum Content Defined Chunking Algorithm for Data Deduplication in Backup Storage Systems [J] . Yucheng Zhang, Dan Feng, Hong Jiang, IEEE Transactions on Computers . 2017,第2期

机译：用于备份存储系统中重复数据删除的快速非对称极值内容定义分块算法
4. A novel chunk coalescing algorithm for data deduplication in cloud storage [C] . Luo Siwei, Hou Mengshu IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies . 2013

机译：云存储中重复数据删除的新型块合并算法
5. Unifying Security and Deduplication in Cloud Storage [D] . Qin, Chuan. 2017

机译：统一云存储中的安全性和重复数据删除
6. Load balancing prediction method of cloud storage based on analytic hierarchy process and hybrid hierarchical genetic algorithm [O] . Xiuze Zhou, Fan Lin, Lvqing Yang, -1

机译：基于层次分析和混合层次遗传算法的云存储负载均衡预测方法
7. Reliability-Aware Deduplication Storage: Assuring Chunk Reliability and Chunk Loss Severity [O] . Youngjin Nam, Guanlin Lu, David H. C. Du 2013

机译：可靠性感知重复数据删除存储：确保块可靠性和块丢失严重性

A new content-defined chunking algorithm for data deduplication in cloud storage

摘要

著录项

相似文献

相关主题

期刊订阅