首页> 美国卫生研究院文献>Briefings in Bioinformatics >Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?
【2h】

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

机译:随机森林生命科学中的数据挖掘:在公园散步还是在丛林中迷路?

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

In the Life Sciences ‘omics’ data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.
机译:在生命科学的“组学”数据中,越来越多的数据是由不同的高通量技术产生的。通常,仅这些数据的整合才能揭示可以通过实验验证或以机械方式建模的生物学见解,即需要复杂的计算方法来提取组学数据中存在的复杂非线性趋势。分类技术可根据变量(例如,遗传关联研究中的SNP)训练模型,以区分不同的类别(例如,健康受试者与患者)。随机森林(RF)是一种通用的分类算法,适用于分析这些大数据集。在生命科学中,RF之所以受欢迎,是因为RF分类模型具有较高的预测准确性,并提供了有关分类变量的重要性的信息。对于组学数据,变量或变量之间的条件关系对于同一类样本的子集通常很重要。例如:在一类癌症患者中,某些SNP组合对于患有特定癌症亚型的患者子集可能很重要,但对不同患者子集却不重要。原则上可以使用RF从数据中发现这些条件关系,因为在创建分类模型时算法会隐式考虑这些条件关系。这篇综述详细介绍了我们所知的一些罕有或从未使用过的RF特性,这些特性可以最大程度地利用RF从复杂的组学数据集中提取生物学见解。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号