首页> 外文期刊>GigaScience >Second-generation PLINK: rising to the challenge of larger and richer datasets
【24h】

Second-generation PLINK: rising to the challenge of larger and richer datasets

机译:第二代PLINK:应对更大,更丰富的数据集的挑战

获取原文
           

摘要

Background PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format. Findings To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O n -time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). Conclusions The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
机译:背景技术PLINK 1是一种广泛使用的开源C / C ++工具集,用于全基因组关联研究(GWAS)和群体遗传学研究。然而,来自插补和全基因组测序研究的数据的稳定积累显示了对关键功能的更快和可扩展实施的强烈需求,例如逻辑回归,连锁不平衡估计和基因组距离评估。此外,GWAS和群体遗传数据现在经常包含基因型可能性,阶段信息和/或多等位基因变体,而PLINK 1的主要数据格式无法代表这些基因型。结果为了解决这些问题,我们正在开发PLINK的第二代代码库。该代码库的第一个主要版本PLINK 1.9引入了比特级并行性,O n-时间/恒定空间Hardy-Weinberg平衡和Fisher精确测试以及许多其他算法改进的广泛使用。总之,这些更改将大多数操作加速了1-4个数量级,并使程序可以处理太大而无法放入RAM的数据集。我们还开发了数据格式的扩展,增加了对基因型可能性,相位,多等位基因变体以及参考与替代等位基因的低开销支持,这是我们计划的第二个版本(PLINK 2.0)的基础。结论PLINK的第二代版本将在性能和兼容性方面提供显着的改进。第一次,无法访问高端计算资源的用户可以对使用的功能丰富且非常庞大的遗传数据集进行多项必要的分析。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号