Using machine learning approaches and genomic data for fracture risk prediction in the US older men

Qing Wu

摘要

Predicting an individual's fracture risk from genomic variants remains a challenge because of the complexity in the data. Numerous Single Nucleotide Polymorphisms (SNPs) associated with bone density and fracture have been discovered by GWASs. However, how to best utilize these genetic variants to predict an individual's fracture risk remain unclear. Conventional statistical approaches do not have the flexibility or the adequacy to model complex genomic data. Thus our aims to 1) develop different machine learning (ML) models from genomic data; 2) to identify the best ML model for fracture prediction. Genomic data of Osteoporotic Fractures in Men cohort Study (N=5,133) were analyzed. Genotype imputation was performed at the Sanger Imputation Server. 1,103 fracture-associated SNPs were identified, and corresponding weighted genetic risk scores were derived for each man in the data. Convention osteoporosis risk factors and identified genomic variants were including for analysis and modeling. Data were normalized and split into a training set (80%) and validation set (20%). For model training, the synthetic minority over-sampling technique was employed to account for low fracture rate, and 10-fold cross-validation was employed for hyper-parameters optimization.Inthetestingset,theareaundertheROCcurve(AUC)and accuracywereusedtoassessthemodelperformance.Wefoundthatthe performance of gradient boosting in predicting fracture was the best among the four models with AUC of 0.71 and the accuracy of 0.88. We found that random forest and neural network have the AUC of 0.70 and 0.69, and the accuracy of 0.80 and 0.84. Logistic regression had the worst performance with the AUC of 0.65 and an accuracy of 0.69. Each pairwise comparison between models was significant (pb0.0001). Thus the ML algorithms have better performance than logistic regression in fracture prediction, and gradient boosting performed the best for the prediction in the men.

机译：由于数据的复杂性，预测来自基因组变体的个体来自基因组变体的骨折风险仍然是一个挑战。通过Gwass发现了与骨密度和骨折相关的许多单一核苷酸多态性（SNP）。然而，如何最好地利用这些遗传变异来预测个体的骨折风险仍然不清楚。传统的统计方法没有模型复杂基因组数据的灵活性或充分性。因此，我们的目标是1）从基因组数据开发不同的机器学习（ML）模型; 2）鉴定裂缝预测的最佳ML模型。分析了男性队列研究中骨质疏松骨折的基因组数据（n = 5,133）。基因型归档在Sanger归纳服务器上进行。鉴定了1,103个骨折相关的SNP，并为数据中的每个人推导出相应的加权遗传风险评分。公约骨质疏松症危险因素和鉴定的基因组变体包括分析和建模。数据被标准化并分成培训集（80％）和验证集（20％）。对于模型训练，使用合成少数群体过度采样技术来占低断裂率，并且使用10倍的交叉验证用于超参数优化。然而，Theereaunertheroccurve（AUC）和精确使用的ToAssessesstheModelperformance.wefoundthat梯度升级的性能预测骨折是四种型号的裂缝，0.71的四个型号，精度为0.88。我们发现随机森林和神经网络的AUC为0.70和0.69，准确度为0.80和0.84。 Logistic回归具有最差的性能，AUC为0.65，精度为0.69。模型之间的每对成对比较都很重要（PB0.0001）。因此，ML算法具有比裂缝预测中的逻辑回归更好的性能，并且梯度升压表现为男性的预测最佳。

Using machine learning approaches and genomic data for fracture risk prediction in the US older men

摘要

著录项

相关主题

期刊订阅