Metadata-related overhead can significantly impact the performance of data deduplication systems, including the real duplication elimination ratio and the deduplication throughput. The amount of metadata produced is mainly determined by the chunking mechanism for the input data stream. In this paper, we propose a metadata harnessing deduplication (MHD) algorithm utilizing a duplication-distribution-based hysteresis re-chunking strategy. MHD harnesses the metadata by dynamically merging multiple non-duplicate chunks into one big chunk represented by one hash value while dividing big chunks straddling duplicate and non-duplicate data regions into small chunks represented with multiple hashes. Experimental results show that the proposed algorithm achieves a lower metadata overhead and a higher deduplication throughput for a given duplication elimination ratio, as compared with other state-of-the-art algorithms such as the Bimodal, Sub Chunk and Sparse Indexing algorithms.
展开▼