Predicting an individual's fracture risk from genomic variants remains a challenge because of the complexity in the data. Numerous Single Nucleotide Polymorphisms (SNPs) associated with bone density and fracture have been discovered by GWASs. However, how to best utilize these genetic variants to predict an individual's fracture risk remain unclear. Conventional statistical approaches do not have the flexibility or the adequacy to model complex genomic data. Thus our aims to 1) develop different machine learning (ML) models from genomic data; 2) to identify the best ML model for fracture prediction. Genomic data of Osteoporotic Fractures in Men cohort Study (N=5,133) were analyzed. Genotype imputation was performed at the Sanger Imputation Server. 1,103 fracture-associated SNPs were identified, and corresponding weighted genetic risk scores were derived for each man in the data. Convention osteoporosis risk factors and identified genomic variants were including for analysis and modeling. Data were normalized and split into a training set (80%) and validation set (20%). For model training, the synthetic minority over-sampling technique was employed to account for low fracture rate, and 10-fold cross-validation was employed for hyper-parameters optimization.Inthetestingset,theareaundertheROCcurve(AUC)and accuracywereusedtoassessthemodelperformance.Wefoundthatthe performance of gradient boosting in predicting fracture was the best among the four models with AUC of 0.71 and the accuracy of 0.88. We found that random forest and neural network have the AUC of 0.70 and 0.69, and the accuracy of 0.80 and 0.84. Logistic regression had the worst performance with the AUC of 0.65 and an accuracy of 0.69. Each pairwise comparison between models was significant (pb0.0001). Thus the ML algorithms have better performance than logistic regression in fracture prediction, and gradient boosting performed the best for the prediction in the men.
展开▼