GDedup: Distributed File System Level Deduplication for Genomic Big Data

机译：GDedup：基因组大数据的分布式文件系统级重复数据删除

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

During the last years, the cost of sequencing has dropped, and the amount of generated genomic sequence data has skyrocketed. As a consequence, genomic sequence data have become more expensive to store than to generate. The storage needs for genomic sequence data are also following this trend. In order to solve these new storage needs, different compression algorithms have been used. Nevertheless, typical compression ratios for genomic data range between 3 and 10. In this paper, we propose the use of GDedup, a deduplication storage system for genomics data, in order to improve data storage capacity and efficiency in distributed file systems without compromising I/O performance. GDedup can be developed by modifying existing storage system environments such as the Hadoop Distributed File System. By taking advantage of deduplication technology, we can better manage the underlying redundancy in genomic sequence data and reduce the space needed to store these files in the file systems, thus allowing for more capacity per volume. We present a study on the relation between the amount of different types of mutations in genomic data such as point mutations, substitutions, inversions, and the effect of such in the deduplication ratio for a data set of vertebrate genomes in FASTA format. The experimental results show that the deduplication ratio values are superior to the actual compression ratio values for both (file read-decompress or write-compress) I/O patterns, highlighting the potential for this technology to be effectively adapted to improve storage management of genomics data.

机译：在过去的几年中，测序的成本下降了，并且生成的基因组序列数据的数量猛增。结果，基因组序列数据的存储比生成的成本更高。对基因组序列数据的存储需求也遵循这种趋势。为了解决这些新的存储需求，已使用了不同的压缩算法。但是，基因组数据的典型压缩比范围为3到10。在本文中，我们建议使用GDedup（一种用于基因组数据的重复数据删除存储系统），以提高分布式文件系统中的数据存储容量和效率，而又不影响I / O的表现。可以通过修改现有存储系统环境（例如Hadoop分布式文件系统）来开发GDedup。通过利用重复数据删除技术，我们可以更好地管理基因组序列数据中的基础冗余，并减少在文件系统中存储这些文件所需的空间，从而允许每个卷具有更大的容量。我们目前就FASTA格式的脊椎动物基因组数据集的基因组数据中不同类型的突变量（如点突变，置换，倒位）及其对重复数据删除率的影响之间的关系进行了研究。实验结果表明，两种（文件读-解压缩或写-压缩）I / O模式的重复数据删除率值均优于实际压缩率值，这突出表明该技术可以有效地应用于改善基因组的存储管理数据。

著录项

来源
《2018 IEEE International Congress on Big Data》|2018年|120-127|共8页
会议地点 San Francisco(US)
作者
Paul Bartus; Emmanuel Arzuaga;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Genomics; Bioinformatics; Distributed databases; DNA; File systems; Indexes;

机译：基因组学;生物信息学;分布式数据库; DNA;文件系统;索引;;

相似文献

外文文献
中文文献
专利

1. Dynamic Deduplication Decision in a Hadoop Distributed File System [J] . Ruay-ShiungChang, Chih-ShanLiao, Kuo-ZhengFan, International Journal of Distributed Sensor Networks . 2014,第1期

机译：Hadoop分布式文件系统中的动态重复数据删除决策
2. Two-level Hash/Table approach for metadata management in distributed file systems [J] . Antonio F. Diaz, Mancia Anguita, Hugo E. Camacho, Journal of supercomputing . 2013,第1期

机译：分布式文件系统中元数据管理的两级哈希/表方法
3. Deduplication TAR Scheme Using User-Level File System [J] . Young-Woong KO, Min-Ja KIM, Jeong-Gun LEE, IEICE transactions on information and systems . 2014,第8期

机译：使用用户级文件系统的重复数据删除TAR方案
4. GDedup: Distributed File System Level Deduplication for Genomic Big Data [C] . Paul Bartus, Emmanuel Arzuaga IEEE International Congress on Big Data . 2018

机译：gdedup：基因组大数据的分布式文件系统级别重复数据删除
5. Application of distributed shared memory to metadata storage in a parallel file system. [D] . Wolinski, Pawel D. 2005

机译：分布式共享内存在并行文件系统中的元数据存储中的应用。
6. Methodologies for Medical Computing. Date Bases and Management Database Management: Smart Files: A Method of Managing Non-Deterministic Data for Multi-Tasking and Distributed Systems [O] . Paul D. Keltz, Catherine N. Pfeil, Melanie H. Okawachi, 1983

机译：医学计算方法。日期基础和管理数据库管理：智能文件：一种用于管理多任务和分布式系统的不确定数据的方法
7. File Systems and Hadoop Distributed File System in Big Data [O] . G Fayaz Hussain, Tarakeswar T 2016

机译：文件系统和Hadoop分布式文件系统在大数据中

GDedup: Distributed File System Level Deduplication for Genomic Big Data

摘要

著录项

相似文献

相关主题

期刊订阅