首页> 外文期刊>Statistical Analysis and Data Mining >A machine learning method for selection of genetic variants to increase prediction accuracy of type 2 diabetes mellitus using sequencing data
【24h】

A machine learning method for selection of genetic variants to increase prediction accuracy of type 2 diabetes mellitus using sequencing data

机译:一种机器学习方法,用于选择遗传变体,增加使用测序数据的2型糖尿病预测准确性

获取原文
       

摘要

Type 2 diabetes mellitus (T2DM) affects millions of people through its life‐altering complications. Worldwide, 3.4 million people die of diabetes annually. Studying the effect of genetic polymorphism on T2DM has been plagued by the available sample size. A 2016 Nature Reviews article summarized that the accuracy of predicting future type 2 diabetes from genetic polymorphism is very low at the population level. Innumerable associations between genes, environmental factors, and type 2 diabetes remain to be discovered. This research presents a method to identify subtle effects of genetic variants using whole genome sequencing data and improve prediction accuracy of T2DM at the population level. To achieve this, a new feature selection procedure and a classifier are proposed. The method involves (a) first applying sparse principal component analysis to genotype data to obtain orthogonal features; (b) building a new classifier using single nucleotide polymorphism (SNP)‐specific regularization parameters to reduce the false positive rate of feature selection; (c) verifying feature relevance through penalized logistic regression. After application to a dataset containing 625?597 SNPs and 23 environmental variables from each of 3326 humans, the method identified 271 genetic variants with subtle effects on T2DM prediction. These variants led to greatly improved prediction accuracy for new patients at the population level. The proposed method also has the advantage of computational efficiency, over 15 times faster than random forest and extreme gradient boosting (XGBoost) classifiers, and thus provides a promising tool for large‐scale genome‐wide association studies.
机译:2型糖尿病(T2DM)通过其生命改变的并发症影响数百万人。全球,每年为340万人死于糖尿病。研究遗传多态性对T2DM的影响已被可用样品尺寸困扰。 2016年的自然评论文章总结说,从遗传多态性预测未来2型糖尿病的准确性在人口水平非常低。基因,环境因素和2型糖尿病之间的无数关联仍可被发现。该研究呈现了一种使用全基因组测序数据来识别遗传变异的微妙影响,提高人口水平T2DM的预测精度。为此,提出了一种新的特征选择过程和分类器。该方法涉及(a)首先将稀疏的主成分分析应用于基因型数据以获得正交特征; (b)使用单核苷酸多态性(SNP)制造新分类器 - 特种正则化参数,以降低特征选择的假阳性率; (c)通过惩罚逻辑回归验证特征相关性。在申请到包含625架的数据集和3326人中每一个的SNP和23个环境变量之后,该方法确定了271个遗传变体,对T2DM预测具有微妙影响。这些变体导致人口水平的新患者的预测准确性大大提高。该方法还具有计算效率的优点,比随机林和极端梯度升压(XGBoost)分类器快15倍,因此为大规模基因组 - 宽协会研究提供了有希望的工具。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号