首页> 美国卫生研究院文献>Scientific Reports >Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data
【2h】

Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data

机译:使用全基因组基因分型数据对克罗恩病患者进行分类的机器学习方法的比较性能

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Crohn Disease (CD) is a complex genetic disorder for which more than 140 genes have been identified using genome wide association studies (GWAS). However, the genetic architecture of the trait remains largely unknown. The recent development of machine learning (ML) approaches incited us to apply them to classify healthy and diseased people according to their genomic information. The Immunochip dataset containing 18,227 CD patients and 34,050 healthy controls enrolled and genotyped by the international Inflammatory Bowel Disease genetic consortium (IIBDGC) has been re-analyzed using a set of ML methods: penalized logistic regression (LR), gradient boosted trees (GBT) and artificial neural networks (NN). The main score used to compare the methods was the Area Under the ROC Curve (AUC) statistics. The impact of quality control (QC), imputing and coding methods on LR results showed that QC methods and imputation of missing genotypes may artificially increase the scores. At the opposite, neither the patient/control ratio nor marker preselection or coding strategies significantly affected the results. LR methods, including Lasso, Ridge and ElasticNet provided similar results with a maximum AUC of 0.80. GBT methods like XGBoost, LightGBM and CatBoost, together with dense NN with one or more hidden layers, provided similar AUC values, suggesting limited epistatic effects in the genetic architecture of the trait. ML methods detected near all the genetic variants previously identified by GWAS among the best predictors plus additional predictors with lower effects. The robustness and complementarity of the different methods are also studied. Compared to LR, non-linear models such as GBT or NN may provide robust complementary approaches to identify and classify genetic markers.
机译:克罗恩病(CD)是一种复杂的遗传疾病,已通过全基因组关联研究(GWAS)鉴定出140多个基因。然而,该性状的遗传结构仍是未知之数。机器学习(ML)方法的最新发展促使我们将其应用于根据基因组信息对健康和患病的人进行分类。使用一组ML方法重新分析了由国际炎症性肠病遗传协会(IIBDGC)登记并进行基因分型的18,227名CD患者和34,050名健康对照的Immunochip数据集:惩罚逻辑回归(LR),梯度增强树(GBT)和人工神经网络(NN)。用于比较方法的主要得分是ROC曲线下面积(AUC)统计数据。质量控制(QC),估算和编码方法对LR结果的影响表明,QC方法和缺失基因型的估算可能人为地增加得分。相反,患者/对照比或标记物的预选或编码策略均不会显着影响结果。 LR方法(包括Lasso,Ridge和ElasticNet)提供了相似的结果,最大AUC为0.80。 GBT方法(如XGBoost,LightGBM和CatBoost)以及具有一个或多个隐藏层的密集NN,提供了相似的AUC值,表明该性状的遗传结构中的上位性作用有限。 ML方法在由GWAS先前识别的所有遗传变异体中检测到的最佳预测因子加上效果较低的其他预测因子附近。还研究了不同方法的鲁棒性和互补性。与LR相比,非线性模型(例如GBT或NN)可以提供可靠的互补方法来识别和分类遗传标记。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号