首页> 外文会议>2018 IEEE International Congress on Big Data >GDedup: Distributed File System Level Deduplication for Genomic Big Data
【24h】

GDedup: Distributed File System Level Deduplication for Genomic Big Data

机译:GDedup:基因组大数据的分布式文件系统级重复数据删除

获取原文
获取原文并翻译 | 示例

摘要

During the last years, the cost of sequencing has dropped, and the amount of generated genomic sequence data has skyrocketed. As a consequence, genomic sequence data have become more expensive to store than to generate. The storage needs for genomic sequence data are also following this trend. In order to solve these new storage needs, different compression algorithms have been used. Nevertheless, typical compression ratios for genomic data range between 3 and 10. In this paper, we propose the use of GDedup, a deduplication storage system for genomics data, in order to improve data storage capacity and efficiency in distributed file systems without compromising I/O performance. GDedup can be developed by modifying existing storage system environments such as the Hadoop Distributed File System. By taking advantage of deduplication technology, we can better manage the underlying redundancy in genomic sequence data and reduce the space needed to store these files in the file systems, thus allowing for more capacity per volume. We present a study on the relation between the amount of different types of mutations in genomic data such as point mutations, substitutions, inversions, and the effect of such in the deduplication ratio for a data set of vertebrate genomes in FASTA format. The experimental results show that the deduplication ratio values are superior to the actual compression ratio values for both (file read-decompress or write-compress) I/O patterns, highlighting the potential for this technology to be effectively adapted to improve storage management of genomics data.
机译:在过去的几年中,测序的成本下降了,并且生成的基因组序列数据的数量猛增。结果,基因组序列数据的存储比生成的成本更高。对基因组序列数据的存储需求也遵循这种趋势。为了解决这些新的存储需求,已使用了不同的压缩算法。但是,基因组数据的典型压缩比范围为3到10。在本文中,我们建议使用GDedup(一种用于基因组数据的重复数据删除存储系统),以提高分布式文件系统中的数据存储容量和效率,而又不影响I / O的表现。可以通过修改现有存储系统环境(例如Hadoop分布式文件系统)来开发GDedup。通过利用重复数据删除技术,我们可以更好地管理基因组序列数据中的基础冗余,并减少在文件系统中存储这些文件所需的空间,从而允许每个卷具有更大的容量。我们目前就FASTA格式的脊椎动物基因组数据集的基因组数据中不同类型的突变量(如点突变,置换,倒位)及其对重复数据删除率的影响之间的关系进行了研究。实验结果表明,两种(文件读-解压缩或写-压缩)I / O模式的重复数据删除率值均优于实际压缩率值,这突出表明该技术可以有效地应用于改善基因组的存储管理数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号