...
首页> 外文期刊>Journal of Clinical Bioinformatics >Splitting random forest (SRF) for determining compact sets of genes that distinguish between cancer subtypes
【24h】

Splitting random forest (SRF) for determining compact sets of genes that distinguish between cancer subtypes

机译:分裂随机森林(SRF)用于确定区分癌症亚型的紧凑基因集

获取原文
           

摘要

Background The identification of very small subsets of predictive variables is an important toπc that has not often been considered in the literature. In order to discover highly predictive yet compact gene set classifiers from whole genome expression data, a non-parametric, iterative algorithm, Splitting Random Forest (SRF), was developed to robustly identify genes that distinguish between molecular subtypes. The goal is to improve the prediction accuracy while considering sparsity. Results The optimal SRF 50 run (SRF50) gene classifiers for glioblastoma (GB), breast (BC) and ovarian cancer (OC) subtypes had overall prediction rates comparable to those from published datasets upon validation (80.1%-91.7%). The SRF50 sets outperformed other methods by identifying compact gene sets needed for distinguishing between tested cancer subtypes (10–200 fold fewer genes than ANOVA or published gene sets). The SRF50 sets achieved superior and robust overall and subtype prediction accuracies when compared with single random forest (RF) and the Top 50 ANOVA results (80.1% vs 77.8% for GB; 84.0% vs 74.1% for BC; 89.8% vs 88.9% for OC in SRF50 vs single RF comparison; 80.1% vs 77.2% for GB; 84.0% vs 82.7% for BC; 89.8% vs 87.0% for OC in SRF50 vs Top 50 ANOVA comparison). There was significant overlap between SRF50 and published gene sets, showing that SRF identifies the relevant sub-sets of important gene lists. Through Ingenuity Pathway Analysis (IPA), the overlap in “hub” genes between the SRF50 and published genes sets were RB1, πK3R1, PDGFBB and ERK1/2 for GB; ESR1, MYC, NFkB and ERK1/2 for BC; and Akt, FN1, NFkB, PDGFBB and ERK1/2 for OC. Conclusions The SRF approach is an effective driver of biomarker discovery research that reduces the number of genes needed for robust classification, dissects complex, high dimensional “omic” data and provides novel insights into the cellular mechanisms that define cancer subtypes.
机译:背景技术识别非常小的预测变量子集是重要的知识,在文献中并未经常考虑。为了从整个基因组表达数据中发现具有高度预测性且紧凑的基因集分类器,开发了一种非参数迭代算法,即分裂随机森林(SRF),以可靠地识别可区分分子亚型的基因。目的是在考虑稀疏性的同时提高预测准确性。结果胶质母细胞瘤(GB),乳腺癌(BC)和卵巢癌(OC)亚型的最佳SRF 50运行(SRF50)基因分类器的总体预测率与验证后的已发布数据集相近(80.1%-91.7%)。 SRF50通过鉴定区分测试的癌症亚型所需的紧凑基因集(比ANOVA或已发表的基因集少10-200倍的基因),胜过其他方法。与单一随机森林(RF)和前50个方差分析结果相比,SRF50集获得了优异而强大的总体和亚型预测准确性(GB分别为80.1%和77.8%; BC分别为84.0%和74.1%; BC为89.8%和88.9% SRF50与单一RF比较中的OC; GB的80.1%vs 77.2%; BC的84.0%相对82.7%; SRF50与Top 50方差分析的OC中89.8%vs 87.0%)。 SRF50与已发表的基因集之间存在明显的重叠,表明SRF可以识别重要基因列表的相关亚集。通过创造力途径分析(IPA),SRF50和已发表的基因集之间的“集线器”基因重叠部分为GB的RB1,πK3R1,PDGFBB和ERK1 / 2。 BC的ESR1,MYC,NFkB和ERK1 / 2;和OC的Akt,FN1,NFkB,PDGFBB和ERK1 / 2。结论SRF方法是生物标志物发现研究的有效驱动力,可减少进行稳健分类所需的基因数量,剖析复杂的高维“ omic”数据,并为定义癌症亚型的细胞机制提供新颖见解。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号