首页> 外文会议>IEEE International Conference on Parallel and Distributed Systems >CACH-Dedup: Content Aware Clustered and Hierarchical Deduplication
【24h】

CACH-Dedup: Content Aware Clustered and Hierarchical Deduplication

机译:CACH-Dedup:内容感知群集和分层重复数据删除

获取原文

摘要

Distributed deduplication overcomes, to some extent, index-lookup disk bottleneck problem by dividing deduplication tasks among many nodes. However, the task of selecting these nodes is an important challenge because it could result in high communication cost and the storage node island effect problem. Moreover, intelligent data routing is required to exploit the peculiar nature of data from different applications which share insignificant amount of content. In this paper, we explore CACH-Dedup, a content aware clustered and hierarchical deduplication system, which exploits the negligibly small amount of content shared among chunks from different file types to create groups of files and storage nodes with out loss of deduplication effectiveness. It uses hierarchical deduplication to reduce the size of fingerprint indexes at the global level, where only files and big sized segments are deduplicated. It also makes advantage of locality first using the big sized segments deduplicated at the global level and second by routing a set of consecutive files together to one storage node. Furthermore, it exploits similarity by making use of similarity bloom filters of streams for stateful routing which results in duplicate elimination rate in a par with single node deduplication with a minimal cost of computation and communication. CACH-Dedup is evaluated using a prototype deployed on windows server environment distributed over four separate machines. It is shown to have duplicate elimination effectiveness in a par with a single node deduplication system, with a minimal communication overhead and an acceptable deduplication throughput.
机译:通过在多个节点之间分配重复数据删除任务,分布式重复数据删除在某种程度上克服了索引查找磁盘瓶颈问题。但是,选择这些节点的任务是一个重要的挑战,因为它可能导致较高的通信成本和存储节点孤岛效应问题。而且,需要智能数据路由来利用来自共享少量内容的不同应用程序的数据的特殊性质。在本文中,我们探索了CACH-Dedup,这是一个内容感知的群集和分层重复数据删除系统,它利用不同文件类型的块之间共享的少量内容来创建文件和存储节点组,而不会造成重复数据删除效果的损失。它使用分层重复数据删除来减少全局级别的指纹索引的大小,其中仅对文件和大型段进行重复数据删除。它还首先利用局部性的优势,即使用在全局级别上进行了重复数据删除的大型段,然后通过将一组连续的文件一起路由到一个存储节点来利用本地性。此外,它通过利用流的相似性布隆过滤器进行状态路由来利用相似性,这导致与单节点重复数据删除相当的重复消除率,而计算和通信的成本却最低。 CACH-Dedup使用部署在分布于四台单独计算机上的Windows服务器环境上的原型进行评估。与单节点重复数据删除系统相比,它具有消除重复的有效性,并具有最小的通信开销和可接受的重复数据删除吞吐量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号