...
首页> 外文期刊>Briefings in bioinformatics >Data mining in the life science swith random forest: A walk in the park or lost in the jungle?
【24h】

Data mining in the life science swith random forest: A walk in the park or lost in the jungle?

机译:在具有随机森林的生命科学中进行数据挖掘:在公园散步还是在丛林中迷路?

获取原文
获取原文并翻译 | 示例

摘要

In the Life Sciences 'omics' data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the ccomplex non-linear trends presentin omics data.Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of he same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by thealgorithm during the creation of the classification model. This review details some of the to the best of our knowledgerarely or never used RF properties that allow maximizing the biological insights that can be extracted from ccomplex omics data sets using RF.
机译:在生命科学中,“组学”数据越来越多地由不同的高通量技术生成。通常,只有这些数据的整合才能揭示可以通过实验验证或以机械方式建模的生物学见解,即需要复杂的计算方法来提取组学数据中复杂的非线性趋势。分类技术允许基于变量(例如SNP)训练模型在遗传关联研究中)以区分不同的类别(例如,健康受试者与患者)。随机森林(RF)是一种通用的分类算法,适用于分析这些大数据集。在生命科学中,RF之所以受欢迎,是因为RF分类模型具有较高的预测准确性,并提供了有关分类变量重要性的信息。对于组学数据,变量或变量之间的条件关系对于同一类样本的子集通常很重要。例如:在一类癌症患者中,某些SNP组合对于具有特定癌症亚型的患者子集可能很重要,但对不同患者子集则不重要。原则上可以使用RF从数据中发现这些条件关系,因为在创建分类模型时,算法会隐式地考虑这些条件关系。这篇综述详细介绍了我们所学到的一些罕有或从未使用过的RF特性,这些特性可以最大化利用RF从复杂的组学数据集中提取的生物学见解。

著录项

相似文献

  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号