首页> 外文期刊>Future generation computer systems >A new content-defined chunking algorithm for data deduplication in cloud storage
【24h】

A new content-defined chunking algorithm for data deduplication in cloud storage

机译:一种新的内容定义的分块算法,用于云存储中的重复数据删除

获取原文
获取原文并翻译 | 示例
       

摘要

Chunking is a process to split a file into smaller files called chunks. In some applications, such as remote data compression, data synchronization, and data deduplication, chunking is important because it determines the duplicate detection performance of the system. Content-defined chunking (CDC) is a method to split files into variable length chunks, where the cut points are defined by some internal features of the files. Unlike fixed-length chunks, variable-length chunks are more resistant to byte shifting. Thus, it increases the probability of finding duplicate chunks within a file and between files. However, CDC algorithms require additional computation to find the cut points which might be computationally expensive for some applications. In our previous work (Widodo et al., 2016), the hash-based CDC algorithm used in the system took more process time than other processes in the deduplication system. This paper proposes a high throughput hash-less chunking method called Rapid Asymmetric Maximum (RAM). Instead of using hashes, RAM uses bytes value to declare the cut points. The algorithm utilizes a fix-sized window and a variable-sized window to find a maximum-valued byte which is the cut point. The maximum-valued byte is included in the chunk and located at the boundary of the chunk. This configuration allows RAM to do fewer comparisons while retaining the CDC property. We compared RAM with existing hash-based and hash-less deduplication systems. The experimental results show that our proposed algorithm has higher throughput and bytes saved per second compared to other chunking algorithms.
机译:分块是将文件拆分为较小文件(称为块)的过程。在某些应用程序中,例如远程数据压缩,数据同步和重复数据删除,分块很重要,因为分块确定了系统的重复检测性能。内容定义的数据块(CDC)是一种将文件拆分为可变长度的数据块的方法,其中切入点由文件的某些内部功能定义。与固定长度的块不同,可变长度的块更耐字节移位。因此,它增加了在文件内以及文件之间发现重复块的可能性。但是,CDC算法需要额外的计算才能找到切入点,这对于某些应用程序可能在计算上昂贵。在我们之前的工作中(Widodo et al。,2016),系统中使用的基于哈希的CDC算法比重复数据删除系统中的其他进程花费更多的处理时间。本文提出了一种称为Rapid Asymmetric Maximum(RAM)的高吞吐量无哈希分块方法。 RAM不使用散列,而使用字节值来声明切点。该算法利用固定大小的窗口和可变大小的窗口来找到最大字节,该最大字节是切点。最大值字节包含在块中,并位于块的边界处。此配置允许RAM进行较少的比较,同时保留CDC属性。我们将RAM与现有的基于散列和无散列的重复数据删除系统进行了比较。实验结果表明,与其他分块算法相比,我们提出的算法具有更高的吞吐量和每秒节省的字节数。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号