Comparing machine learning and logistic regression methods for predicting hypertension using a combination of gene expression and next-generation sequencing data

Elizabeth Held; Joshua Cape; Nathan Tintle

摘要

Machine learning methods continue to show promise in the analysis of data from genetic association studies because of the high number of variables relative to the number of observations. However, few best practices exist for the application of these methods. We extend a recently proposed supervised machine learning approach for predicting disease risk by genotypes to be able to incorporate gene expression data and rare variants. We then apply 2 different versions of the approach (radial and linear support vector machines) to simulated data from Genetic Analysis Workshop 19 and compare performance to logistic regression. Method performance was not radically different across the 3 methods, although the linear support vector machine tended to show small gains in predictive ability relative to a radial support vector machine and logistic regression. Importantly, as the number of genes in the models was increased, even when those genes contained causal rare variants, model predictive ability showed a statistically significant decrease in performance for both the radial support vector machine and logistic regression. The linear support vector machine showed more robust performance to the inclusion of additional genes. Further work is needed to evaluate machine learning approaches on larger samples and to evaluate the relative improvement in model prediction from the incorporation of gene expression data.

机译：由于相对于观察数量的变量数量高，机器学习方法继续在分析来自遗传关联研究的数据中的承诺。但是，对于这些方法的应用，存在很少的最佳实践。我们扩展了最近提出的监督机器学习方法，用于通过基因型预测疾病风险，以能够掺入基因表达数据和罕见的变体。然后，我们将2个不同版本的方法（径向和线性支持向量机）应用于从遗传分析研讨会19的模拟数据，并将性能与Logistic回归进行比较。方法性能在3种方法中没有差异，尽管线性支持向量机倾向于在相对于径向支持向量机和逻辑回归中显示出预测能力的小增益。重要的是，随着模型中基因的数量增加，即使这些基因包含因果稀有变体，也表现出径向支持向量机和逻辑回归的性能的统计显着降低。线性支持向量机表现出更强大的性能，包括额外的基因。需要进一步的工作来评估较大样本的机器学习方法，并评估来自Gene表达数据的模型预测的相对改善。

Comparing machine learning and logistic regression methods for predicting hypertension using a combination of gene expression and next-generation sequencing data

摘要

著录项

相关主题

期刊订阅