首页> 外文期刊>IEEE Transactions on Computers >GenoDedup: Similarity-Based Deduplication and Delta-Encoding for Genome Sequencing Data
【24h】

GenoDedup: Similarity-Based Deduplication and Delta-Encoding for Genome Sequencing Data

机译:Genodedup:基于相似性的重复数据删除和δ编码基因组测序数据

获取原文
获取原文并翻译 | 示例

摘要

The vast datasets produced in human genomics must be efficiently stored, transferred, and processed while prioritizing storage space and restore performance. Balancing these two properties becomes challenging when resorting to traditional data compression techniques. In fact, specialized algorithms for compressing sequencing data favor the former, while large genome repositories widely resort to generic compressors (e.g., GZIP) to benefit from the latter. Notably, human beings have approximately 99.9 percent of DNA sequence similarity, vouching for an excellent opportunity for deduplication and its assets: leveraging inter-file similarity and achieving higher read performance. However, identity-based deduplication fails to provide a satisfactory reduction in the storage requirements of genomes. In this article, we balance space savings and restore performance by proposing GenoDedup, the first method that integrates efficient similarity-based deduplication and specialized delta-encoding for genome sequencing data. Our solution currently achieves 67.8 percent of the reduction gains of SPRING (i.e., the best specialized tool in this metric) and restores data 1.62x faster than SeqDB (i.e., the fastest competitor). Additionally, GenoDedup restores data 9.96x faster than SPRING and compresses files 2.05x more than SeqDB.
机译:在优先化存储空间和恢复性能的同时,必须有效地存储,转移和处理中生成的庞大数据集。在借助传统的数据压缩技术时,平衡这两个物业变得具有挑战性。实际上,专用算法用于压缩序列数据的偏爱前者,而大型基因组储存库广泛寻求通用压缩机(例如,GZIP),从后者受益。值得注意的是,人类的DNA序列相似度约为99.9%,为重复数据删除的绝佳机会及其资产:利用档间间相似性并实现更高的读取性能。然而,基于身份的重复数据删除可以在基因组的存储要求中提供令人满意的减少。在本文中,我们通过提出Genodedup来平衡空间节省和恢复性能,这是集基因组测序数据的基于高效的相似性的重复数据删除和专用δ编码的第一种方法。我们的解决方案目前实现了67.8%的春季减少收益(即,该度量最佳专业工具),并恢复比SEQDB更快的数据1.62x(即,最快的竞争对手)。此外,Genodedup恢复数据9.96x比Spring快,并压缩比SEQDB的文件2.05x。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号