首页> 外文期刊>BMC Bioinformatics >binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions
【24h】

binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions

机译:BinomialRF:可解释的随机森林的组合效率识别生物标志物相互作用

获取原文
           

摘要

In this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcripts) than samples (i.e., mice or human samples) in a study, it poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest (RF) classifiers are widely used due to their flexibility, powerful performance, their ability to rank features, and their robustness to the “P ??N” high-dimensional limitation that many matrix regression algorithms face. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions. In both simulations and validation studies using datasets from the TCGA and UCI repositories, binomialRF showed computational gains (up to 5 to 300 times faster) while maintaining competitive variable precision and recall in identifying biomarkers’ main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone. binomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers’ main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide pathway-level feature selection from gene expression input data.
机译:在数据科学驱动的生物信息学时代,机器学习研究专注于特征选择,因为用户想要更多的解释和生物标志物检测后的HOC分析。然而,当研究中的样品(即转录物或人体样本)中有更多的特征(即转录物)时,随着传统统计技术在高维度提供动力的情况下,它在生物标志物检测任务中造成了重大统计挑战。这些特征的第二和三阶相互作用构成了大量的组合尺寸挑战。在计算生物学中,随机森林(RF)分类器是由于它们的灵活性,强大的性能,它们对特征的能力而被广泛使用,以及它们对“p>?>?n”的鲁棒性,以及许多矩阵回归算法面的高维限制。我们提出了BinomialRF,RFS中的特征选择技术,提供了使用相关的二项式分布的特征的替代解释,并有效地进行缩放以分析多向交互。在使用来自TCGA和UCI存储库的数据集的两种模拟和验证研究中,BinomialRF在保持竞争性可变精度和识别生物标志物的主要效果和相互作用时显示了计算增益(最多5到300倍)。在两个临床研究中,BinomialRF算法以前公布的相关病理分子机制(特征)优先于具有高分类精度并单独使用特征,以及它们单独的统计相互作用。 BinomialRF在先前的方法上延伸,用于识别RFS中的可解释特征,并在相关的二项式分布下将它们聚集在一起,以创建一种有效的假设测试算法,该算法识别生物标志物的主要效果和相互作用。初步结果在模拟中展示了计算收益,同时保持竞争模型选择和分类准确性。未来的工作将扩展此框架,以合并提供从基因表达式输入数据提供路径级功能选择的本体。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号