...
首页> 外文期刊>Nature protocols erecipes for researchers >Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data
【24h】

Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data

机译:机器学习工作流程以估算DNA甲基化微阵列数据精密癌症诊断的阶级概率

获取原文
获取原文并翻译 | 示例
           

摘要

DNA methylation data-based precision cancer diagnostics is emerging as the state of the art for molecular tumor classification. Standards for choosing statistical methods with regard to well-calibrated probability estimates for these typically highly multiclass classification tasks are still lacking. To support this choice, we evaluated well-established machine learning (ML) classifiers including random forests (RFs), elastic net (ELNET), support vector machines (SVMs) and boosted trees in combination with post-processing algorithms and developed ML workflows that allow for unbiased class probability (CP) estimation. Calibrators included ridge-penalized multinomial logistic regression (MR) and Platt scaling by fitting logistic regression (LR) and Firth's penalized LR. We compared these workflows on a recently published brain tumor 450k DNA methylation cohort of 2,801 samples with 91 diagnostic categories using a 5 x 5-fold nested cross-validation scheme and demonstrated their generalizability on external data from The Cancer Genome Atlas. ELNET was the top stand-alone classifier with the best calibration profiles. The best overall two-stage workflow was MR-calibrated SVM with linear kernels closely followed by ridge-calibrated tuned RF. For calibration, MR was the most effective regardless of the primary classifier. The protocols developed as a result of these comparisons provide valuable guidance on choosing ML workflows and their tuning to generate well-calibrated CP estimates for precision diagnostics using DNA methylation data. Computation times vary depending on the ML algorithm from <15 min to 5 d using multi-core desktop PCs. Detailed scripts in the open-source R language are freely available on GitHub, targeting users with intermediate experience in bioinformatics and statistics and using R with Bioconductor extensions.
机译:基于DNA甲基化数据的精密癌症诊断是作为分子肿瘤分类的最新技术。选择关于这些通常高度多包分类任务的校准概率估计的统计方法的标准仍然缺乏。为了支持这种选择,我们评估了良好的机器学习(ML)分类器,包括随机森林(RFS),弹性网(ELNET),支持向量机(SVM)和促进树木,与后处理算法组合,并开发ML工作流程允许非偏见的类概率(CP)估计。校准器包括通过拟合Logistic回归(LR)和Firth受到惩罚的LR,包括脊柱惩罚多项式物流回归(MR)和Platt缩放。我们将这些工作流与使用5×5倍的嵌套交叉验证方案的91个诊断类别的最近发表的脑肿瘤450k DNA甲基化队列的这些工作流程进行了比较。 ELNET是顶级独立分类器,具有最佳校准配置文件。最好的两阶段工作流程是MR-CALIBLED SVM,线性核心紧密,接着是RIDGE校准的调谐RF。对于校准,无论主分类器如何,MR都是最有效的。由于这些比较而开发的协议为选择ML工作流程和调整提供了有价值的指导,并使用DNA甲基化数据产生精密诊断的良好校准CP估计。计算时间根据使用多核桌面PC的<15 min至5 d的ml算法而变化。 Open-Source R语言中的详细脚本在GitHub上自由地提供,针对具有生物信息学和统计数据的中间体验的用户,并使用r与生物导体延伸。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号