...
首页> 外文期刊>PLoS Genetics >A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank
【24h】

A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank

机译:一种快速而可扩展的框架,用于大规模和超高维度稀疏回归,应用于英国Biobank

获取原文
           

摘要

The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet , an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ? _(1)-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ? _(1)/ ? _(2) penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.
机译:英国Biobank是一项非常大,前瞻性的人口基于英国的队列研究。它为研究人员提供了前所未有的机会,调查基因型信息与感兴趣表型之间的关系。与基因组关联研究(GWAs)相比,多元回归方法已经显示出大大改善各种表型的预测性能。在高维设置中,套索是由于其第一次统计提案,已被证明是同时可变选择和估计的有效方法。然而,英国Biobank中看到的大规模和超高尺寸为应用套索方法构成了新的挑战,因为许多现有算法及其实现都不可扩展到大型应用程序。在本文中,我们提出了一种称为批量筛选迭代套索(Basil)的计算框架,可以利用任何现有的套索求解器,并轻松构建可扩展的解决方案,以实现非常大的数据,包括那些大于存储器大小的数据。我们介绍SNPNet,一个R包,它在GLMNET上实现所提出的算法,并优化单核苷酸多态性(SNP)数据集。它目前支持吗? _(1) - 长化线性模型,逻辑回归,Cox模型,也延伸到弹性网与弹性网延伸? _(1)/? _(2)罚款。我们在英国Biobank数据集上展示了结果,在那里,与其他既定的多基因风险评分方法相比,我们对所有四种成本(高度,体重指数,哮喘,高胆固醇)的所有四种表型进行了竞争预测性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号