首页> 外文期刊>Discrete Applied Mathematics >Similarity based deduplication with small data chunks
【24h】

Similarity based deduplication with small data chunks

机译:具有小数据块的基于相似性的重复数据删除

获取原文
获取原文并翻译 | 示例
           

摘要

Large backup and restore systems may have a petabyte or more data in their repository. Such systems are often compressed by means of deduplication techniques, that partition the input text into chunks and store recurring chunks only once. One of the approaches is to use hashing methods to store fingerprints for each data chunk, detecting identical chunks with very low probability for collisions. As alternative, it has been suggested to use similarity instead of identity based searches, which allows the definition of much larger chunks. This implies that the data structure needed to store the fingerprints is much smaller, so that such a system may be more scalable than systems built on the first approach.
机译:大型备份和还原系统在其存储库中可能有PB或更多数据。这样的系统通常通过重复数据删除技术进行压缩,该技术将输入文本划分为多个块并仅将重复的块存储一次。一种方法是使用哈希方法存储每个数据块的指纹,以极低的冲突概率检测相同的块。作为替代方案,已经建议使用相似性而不是基于身份的搜索,这样可以定义更大的块。这意味着存储指纹所需的数据结构要小得多,因此,这种系统可能比基于第一种方法构建的系统更具可伸缩性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号