首页> 美国卫生研究院文献>other >TheSNPpit—A High Performance Database System for Managing Large Scale SNP Data
【2h】

TheSNPpit—A High Performance Database System for Managing Large Scale SNP Data

机译:TheSNPpit-用于管理大规模SNP数据的高性能数据库系统

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The fast development of high throughput genotyping has opened up new possibilities in genetics while at the same time producing considerable data handling issues. TheSNPpit is a database system for managing large amounts of multi panel SNP genotype data from any genotyping platform. With an increasing rate of genotyping in areas like animal and plant breeding as well as human genetics, already now hundreds of thousand of individuals need to be managed. While the common database design with one row per SNP can manage hundreds of samples this approach becomes progressively slower as the size of the data sets increase until it finally fails completely once tens or even hundreds of thousands of individuals need to be managed. TheSNPpit has implemented three ideas to also accomodate such large scale experiments: highly compressed vector storage in a relational database, set based data manipulation, and a very fast export written in C with Perl as the base for the framework and PostgreSQL as the database backend. Its novel subset system allows the creation of named subsets based on the filtering of SNP (based on major allele frequency, no-calls, and chromosomes) and manually applied sample and SNP lists at negligible storage costs, thus avoiding the issue of proliferating file copies. The named subsets are exported for down stream analysis. PLINK ped and map files are processed as in- and outputs. TheSNPpit allows management of different panel sizes in the same population of individuals when higher density panels replace previous lower density versions as it occurs in animal and plant breeding programs. A completely generalized procedure allows storage of phenotypes. TheSNPpit only occupies 2 bits for storing a single SNP implying a capacity of 4 mio SNPs per 1MB of disk storage. To investigate performance scaling, a database with more than 18.5 mio samples has been created with 3.4 trillion SNPs from 12 panels ranging from 1000 through 20 mio SNPs resulting in a database of 850GB. The import and export performance scales linearly with the number of SNPs and is largely independent of panel and database size. Import speed is around 6 mio SNPs/sec, export between 60 and 120 mio SNPs/sec. Being command line based, imports and exports can easily be integrated into pipelines. TheSNPpit is available under the Open Source GNU General Public License (GPL) Version 2.
机译:高通量基因分型的快速发展为遗传学开辟了新的可能性,同时又产生了大量的数据处理问题。 SNPpit是一个数据库系统,用于管理来自任何基因分型平台的大量多面板SNP基因型数据。随着动植物育种以及人类遗传学等领域的基因分型率不断提高,现在已经需要管理数十万个人。尽管每个SNP具有一行的通用数据库设计可以管理数百个样本,但是随着数据集大小的增加,这种方法逐渐变得越来越慢,直到需要管理成千上万的个人最终完全失败为止。 SNPpit已经实现了三个想法以适应这样的大规模实验:关系数据库中的高度压缩向量存储,基于集合的数据操作以及以C语言编写的非常快速的导出,其中Perl作为框架的基础,而PostgreSQL作为数据库后端。其新颖的子集系统允许基于SNP的过滤(基于主要等位基因频率,无呼叫和染色体)以及以可忽略的存储成本手动应用的样品和SNP列表来创建命名子集,从而避免了文件副本激增的问题。导出命名的子集以进行下游分析。 PLINK ped和映射文件被作为输入和输出处理。当在动物和植物育种计划中出现较高密度的面板替代以前的较低密度版本时,SNPpit允许在同一人口群体中管理不同的面板大小。完全通用的程序可以存储表型。 SNPpit仅占用2位用于存储单个SNP,这意味着每1MB磁盘存储容量4 mio SNP。为了研究性能扩展,已经创建了一个数据库,其中包含超过18.5个mio样本,该数据库具有12个面板(从1000到20 mio SNP)中的3.4万亿个SNP,从而形成了850GB的数据库。导入和导出性能与SNP数量成线性比例关系,并且在很大程度上与面板和数据库大小无关。导入速度约为6 mio SNP /秒,导出速度为60至120 mio SNP /秒。进出口基于命令行,可以轻松集成到管道中。 TheSNPpit在开源GNU通用公共许可证(GPL)版本2下可用。

著录项

  • 期刊名称 other
  • 作者单位
  • 年(卷),期 -1(11),10
  • 年度 -1
  • 页码 e0164043
  • 总页数 18
  • 原文格式 PDF
  • 正文语种
  • 中图分类
  • 关键词

  • 入库时间 2022-08-21 11:11:15

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号