首页> 美国卫生研究院文献>other >TheSNPpit—A High Performance Database System for Managing Large Scale SNP Data

【2h】

TheSNPpit—A High Performance Database System for Managing Large Scale SNP Data

机译：TheSNPpit-用于管理大规模SNP数据的高性能数据库系统

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

The fast development of high throughput genotyping has opened up new possibilities in genetics while at the same time producing considerable data handling issues. TheSNPpit is a database system for managing large amounts of multi panel SNP genotype data from any genotyping platform. With an increasing rate of genotyping in areas like animal and plant breeding as well as human genetics, already now hundreds of thousand of individuals need to be managed. While the common database design with one row per SNP can manage hundreds of samples this approach becomes progressively slower as the size of the data sets increase until it finally fails completely once tens or even hundreds of thousands of individuals need to be managed. TheSNPpit has implemented three ideas to also accomodate such large scale experiments: highly compressed vector storage in a relational database, set based data manipulation, and a very fast export written in C with Perl as the base for the framework and PostgreSQL as the database backend. Its novel subset system allows the creation of named subsets based on the filtering of SNP (based on major allele frequency, no-calls, and chromosomes) and manually applied sample and SNP lists at negligible storage costs, thus avoiding the issue of proliferating file copies. The named subsets are exported for down stream analysis. PLINK ped and map files are processed as in- and outputs. TheSNPpit allows management of different panel sizes in the same population of individuals when higher density panels replace previous lower density versions as it occurs in animal and plant breeding programs. A completely generalized procedure allows storage of phenotypes. TheSNPpit only occupies 2 bits for storing a single SNP implying a capacity of 4 mio SNPs per 1MB of disk storage. To investigate performance scaling, a database with more than 18.5 mio samples has been created with 3.4 trillion SNPs from 12 panels ranging from 1000 through 20 mio SNPs resulting in a database of 850GB. The import and export performance scales linearly with the number of SNPs and is largely independent of panel and database size. Import speed is around 6 mio SNPs/sec, export between 60 and 120 mio SNPs/sec. Being command line based, imports and exports can easily be integrated into pipelines. TheSNPpit is available under the Open Source GNU General Public License (GPL) Version 2.

机译：高通量基因分型的快速发展为遗传学开辟了新的可能性，同时又产生了大量的数据处理问题。 SNPpit是一个数据库系统，用于管理来自任何基因分型平台的大量多面板SNP基因型数据。随着动植物育种以及人类遗传学等领域的基因分型率不断提高，现在已经需要管理数十万个人。尽管每个SNP具有一行的通用数据库设计可以管理数百个样本，但是随着数据集大小的增加，这种方法逐渐变得越来越慢，直到需要管理成千上万的个人最终完全失败为止。 SNPpit已经实现了三个想法以适应这样的大规模实验：关系数据库中的高度压缩向量存储，基于集合的数据操作以及以C语言编写的非常快速的导出，其中Perl作为框架的基础，而PostgreSQL作为数据库后端。其新颖的子集系统允许基于SNP的过滤（基于主要等位基因频率，无呼叫和染色体）以及以可忽略的存储成本手动应用的样品和SNP列表来创建命名子集，从而避免了文件副本激增的问题。导出命名的子集以进行下游分析。 PLINK ped和映射文件被作为输入和输出处理。当在动物和植物育种计划中出现较高密度的面板替代以前的较低密度版本时，SNPpit允许在同一人口群体中管理不同的面板大小。完全通用的程序可以存储表型。 SNPpit仅占用2位用于存储单个SNP，这意味着每1MB磁盘存储容量4 mio SNP。为了研究性能扩展，已经创建了一个数据库，其中包含超过18.5个mio样本，该数据库具有12个面板（从1000到20 mio SNP）中的3.4万亿个SNP，从而形成了850GB的数据库。导入和导出性能与SNP数量成线性比例关系，并且在很大程度上与面板和数据库大小无关。导入速度约为6 mio SNP /秒，导出速度为60至120 mio SNP /秒。进出口基于命令行，可以轻松集成到管道中。 TheSNPpit在开源GNU通用公共许可证（GPL）版本2下可用。

著录项

期刊名称 other
作者
Eildert Groeneveld; Helmut Lichtenberg;
展开▼
作者单位

展开▼
年(卷),期 -1(11),10
年度 -1
页码 e0164043
总页数 18
原文格式 PDF
正文语种
中图分类
关键词
入库时间 2022-08-21 11:11:15

相似文献

外文文献
中文文献
专利

1. Large-scale identification of human bone remains via SNP microarray analysis with reference SNP database [J] . Cho Sohee, Kim Moon-Young, Lee Ji Hyun, Forensic science international. Genetics . 2020,第1期

机译：使用参考SNP数据库通过SNP MicroArray分析进行大规模识别人骨。
2. Managing database server performance to meet QoS requirements in electronic commerce systems [J] . Patrick Martin, Wendy Powley, Hoi-Ying Li, International journal on digital libraries . 2002,第4期

机译：管理数据库服务器性能以满足电子商务系统中的QoS要求
3. Development and implementation of a database system to manage a large-scale mouse ENU-mutagenesis program [J] . Hiroshi Masuya, Yuji Nakai, Hiromi Motegi, Mammalian genome: official journal of the International Mammalian Genome Society . 2004,第5期

机译：开发和实现用于管理大型鼠标ENU诱变程序的数据库系统
4. Data types managed database design for dynamic content: A database design for Personal Health Book system [C] . Alabbasi Seddiq, Ahmed Arif, Kaneko Kunihiko, IEEE Region 10 Conference . 2014

机译：用于动态内容的数据类型托管的数据库设计：个人健康书系统的数据库设计
5. Performance tuning of large Oracle(RTM) database systems: A study of performance tuning concepts and strategies for large Oracle database systems on UNIX platforms. [D] . Dokka, Ramabhadra R. 1998

机译：大型Oracle（RTM）数据库系统的性能调优：针对UNIX平台上的大型Oracle数据库系统的性能调优概念和策略的研究。
6. A High-Performance Database System for Managing Large Multi-resolution Medical Images [O] . Tahsin Kurc, Michael Beynon, Chialin Chang, 1999

机译：用于管理大型多分辨率医学图像的高性能数据库系统
7. TheSNPpit-A High Performance Database System for Managing Large Scale SNP Data. [O] . Eildert Groeneveld, Helmut Lichtenberg 2016

机译：用于管理大规模sNp数据的sNppit-a高性能数据库系统。

TheSNPpit—A High Performance Database System for Managing Large Scale SNP Data

摘要

著录项

相似文献

相关主题

期刊订阅