首页> 外文期刊>IEEE Transactions on Computers >A Fast Asymmetric Extremum Content Defined Chunking Algorithm for Data Deduplication in Backup Storage Systems
【24h】

A Fast Asymmetric Extremum Content Defined Chunking Algorithm for Data Deduplication in Backup Storage Systems

机译:用于备份存储系统中重复数据删除的快速非对称极值内容定义分块算法

获取原文
获取原文并翻译 | 示例
           

摘要

Chunk-level deduplication plays an important role in backup storage systems. Existing Content-Defined Chunking (CDC) algorithms, while robust in finding suitable chunk boundaries, face the key challenges of (1) low chunking throughput that renders the chunking stage a serious deduplication performance bottleneck, (2) large chunk size variance that decreases deduplication efficiency, and (3) being unable to find proper chunk boundaries in low-entropy strings and thus failing to deduplicate these strings. To address these challenges, this paper proposes a new CDC algorithm called the Asymmetric Extremum (AE) algorithm. The main idea behind AE is based on the observation that the extreme value in an asymmetric local range is not likely to be replaced by a new extreme value in dealing with the boundaries-shifting problem. As a result, AE has higher chunking throughput, smaller chunk size variance than the existing CDC algorithms, and is able to find proper chunk boundaries in low-entropy strings. The experimental results based on real-world datasets show that AE improves the throughput performance of the state-of-the-art CDC algorithms by more than 2.3× , which is fast enough to remove the chunking-throughput performance bottleneck of deduplication, and accelerates the system throughput by more than 50 percent, while achieving comparable deduplication efficiency.
机译:块级重复数据删除在备份存储系统中扮演重要角色。现有的内容定义分块(CDC)算法在找到合适的块边界方面很强大,但面临以下主要挑战:(1)低分块吞吐量,这使分块阶段成为严重的重复数据删除性能瓶颈;(2)大的块大小差异会减少重复数据删除(3)无法在低熵字符串中找到合适的块边界,从而无法对这些字符串进行重复数据删除。为了解决这些挑战,本文提出了一种新的CDC算法,称为非对称极值(AE)算法。 AE背后的主要思想是基于这样的观察:在处理边界移动问题时,不对称局部范围内的极值不可能被新的极值代替。结果,与现有的CDC算法相比,AE具有更高的组块吞吐量,更小的组块大小差异,并且能够在低熵字符串中找到合适的组块边界。基于真实数据集的实验结果表明,AE将最新CDC算法的吞吐性能提高了2.3倍以上,这足以消除重复数据删除的分块吞吐量性能瓶颈,并加快了速度系统吞吐量提高了50%以上,同时实现了可比的重复数据删除效率。

著录项

  • 来源
    《IEEE Transactions on Computers》 |2017年第2期|199-211|共13页
  • 作者单位

    Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China;

    Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China;

    Department of Computer Science and Engineering, University of Texas at Arlington, 640 ERB, 500 UTA Blvd, Arlington, TX;

    Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China;

    Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China;

    Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China;

    Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Throughput; Optimization; Power capacitors; Approximation algorithms; Redundancy; Robustness; Acceleration;

    机译:吞吐量;优化;功率电容器;近似算法;冗余;稳健性;加速;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号