Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

Dandi Qiao; Wai-Ki Yip; Christoph Lange

首页> 外文期刊>BMC Bioinformatics >Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

【24h】

Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

机译：处理高通量测序数据的数据管理需求：SpeedGene，一种用于有效存储遗传数据的压缩算法

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background As Next-Generation Sequencing data becomes available, existing hardware environments do not provide sufficient storage space and computational power to store and process the data due to their enormous size. This is and will be a frequent problem that is encountered everyday by researchers who are working on genetic data. There are some options available for compressing and storing such data, such as general-purpose compression software, PBAT/PLINK binary format, etc. However, these currently available methods either do not offer sufficient compression rates, or require a great amount of CPU time for decompression and loading every time the data is accessed. Results Here, we propose a novel and simple algorithm for storing such sequencing data. We show that, the compression factor of the algorithm ranges from 16 to several hundreds, which potentially allows SNP data of hundreds of Gigabytes to be stored in hundreds of Megabytes. We provide a C++ implementation of the algorithm, which supports direct loading and parallel loading of the compressed format without requiring extra time for decompression. By applying the algorithm to simulated and real datasets, we show that the algorithm gives greater compression rate than the commonly used compression methods, and the data-loading process takes less time. Also, The C++ library provides direct-data-retrieving functions, which allows the compressed information to be easily accessed by other C++ programs. Conclusions The SpeedGene algorithm enables the storage and the analysis of next generation sequencing data in current hardware environment, making system upgrades unnecessary.

机译：背景技术随着下一代测序数据的可用，现有的硬件环境由于其庞大的规模而无法提供足够的存储空间和计算能力来存储和处理数据。这是并且将是每天从事遗传数据研究的人员经常遇到的问题。有一些选项可用于压缩和存储此类数据，例如通用压缩软件，PBAT / PLINK二进制格式等。但是，这些当前可用的方法要么不能提供足够的压缩率，要么需要大量的CPU时间。用于在每次访问数据时解压缩和加载。结果在这里，我们提出了一种新颖而简单的算法来存储此类测序数据。我们表明，该算法的压缩因子范围从16到几百，这可能使数百GB的SNP数据存储在数百MB中。我们提供了该算法的C ++实现，它支持直接加载和并行加载压缩格式，而无需花费额外的时间进行解压缩。通过将该算法应用于模拟数据集和真实数据集，我们表明该算法比常用的压缩方法具有更高的压缩率，并且数据加载过程花费的时间更少。同样，C ++库提供直接数据检索功能，该功能允许其他C ++程序轻松访问压缩的信息。结论SpeedGene算法可在当前硬件环境中存储和分析下一代测序数据，从而无需进行系统升级。

著录项

来源
《BMC Bioinformatics》 |2012年第1期|共页
作者
Dandi Qiao; Wai-Ki Yip; Christoph Lange;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类生物科学;
关键词

相似文献

外文文献
中文文献
专利

1. Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data [J] . Dandi Qiao, Wai-Ki Yip, Christoph Lange BMC Bioinformatics . 2012,第1期

机译：处理高通量测序数据的数据管理需求：SpeedGene，一种用于有效存储遗传数据的压缩算法
2. Data structures and compression algorithms for high-throughput sequencing technologies [J] . Kenny Daily, Paul Rigor, Scott Christley, BMC Bioinformatics . 2010,第1期

机译：高通量测序技术的数据结构和压缩算法
3. Efficient seismic response data storage and transmission using ARX model-based sensor data compression algorithm [J] . Yunfeng Zhang, Jian Li Earthquake Engineering & Structural Dynamics . 2006,第6期

机译：使用基于ARX模型的传感器数据压缩算法进行有效的地震响应数据存储和传输
4. A Log-Linear Graphical Model for inferring genetic networks from high-throughput sequencing data [C] . Allen Genevera I., Liu Zhandong 2012 IEEE International Conference on Bioinformatics and Biomedicine. . 2012

机译：从高通量测序数据推断遗传网络的对数线性图形模型
5. Algorithms for Determining Differentially Expressed Genes and Chromosome Structures From High-Throughput Sequencing Data. [D] . Yang, Yi-Wen. 2015

机译：从高通量测序数据确定差异表达基因和染色体结构的算法。
6. Handling the data management needs of high-throughput sequencing data: SpeedGene a compression algorithm for the efficient storage of genetic data [O] . Dandi Qiao, Wai-Ki Yip, Christoph Lange 2012

机译：处理高通量测序数据的数据管理需求：SpeedGene一种用于有效存储遗传数据的压缩算法
7. Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data [O] . 2012

机译：处理高通量测序数据的数据管理需求：SpeedGene，一种用于有效存储遗传数据的压缩算法

Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

摘要

著录项

相似文献

相关主题

期刊订阅