首页> 外文会议>IEEE International Conference on Data Science and Advanced Analytics >Random Forest Framework Customized to Handle Highly Correlated Variables: An Extensive Experimental Study Applied to Feature Selection in Genetic Data
【24h】

Random Forest Framework Customized to Handle Highly Correlated Variables: An Extensive Experimental Study Applied to Feature Selection in Genetic Data

机译:定制用于处理高度相关变量的随机森林框架:一项广泛的实验研究,用于遗传数据的特征选择

获取原文

摘要

The random forest model is a popular framework used in classification and regression. In cases where high correlations exist within the data, it may be beneficial to capture these dependencies through latent variables, for an enhanced use of the random forest framework. In this paper, we present Sylva, the second proposal of a random forest with latent variables after T-Trees, derived from the seminal works of Botta and co-workers (Botta et al., 2008). Sylva is an innovative hybrid approach in which the dynamic generation of latent variables used to learn the random forest is driven by an additional forest model, this time a forest of latent tree models. The latter forest model, a class of Bayesian networks devised in (Mourad et al., 2011), allows a flexible modeling of the dependencies existing within the data. In the comprehensive study reported here, three variants of Sylva, instantiated by different clustering methods (CAST, DBSCAN, Louvain method), are compared to T-Trees using high-dimensional real-world datasets (161 datasets each describing around 5,000 observations and between 5,700 and 39,000 variables) in the context of genetic association studies. We show that T-Trees and Sylva have comparable high predictive powers (aeras under the ROC curves), that lie in range [0.887, 0.961] (T-Trees), and in interval [0.885, 0.979] (over the three Sylva instantiations). Interestingly, T-Trees and Sylva are shown to differ significantly in their importance measure distributions: in Sylva, the importance measure distribution corresponding to top ranked variables is significantly skewed towards higher values than in T-Trees, which meets the feature selection enhancement objective. This property holds true for the three instantiations of Sylva. In addition, the thorough analysis of the number of top-ranked variables jointly identified by T-Trees and Sylva highlights the possibility to cross-validate the findings, in order to constitute a priorized list of features (e.g., to be further analyzed by biologists, in the context of genetic association studies). Finally, we conclude that it is recommended to use CAST or DBSCAN, and not the Louvain method, on the 161 datasets analyzed, to increase the probability of Sylva to detect top variables missedby T-Trees among its top ranked variables.
机译:随机森林模型是用于分类和回归的流行框架。在数据中存在高相关性的情况下,通过潜在变量捕获这些依赖关系可能会有所益处,以增强对随机森林框架的使用。在本文中,我们介绍了Sylva,这是继Bot和同事的开创性工作之后(Botta等人,2008)得出的具有T形树之后具有潜在变量的随机森林的第二个建议。 Sylva是一种创新的混合方法,其中用于学习随机森林的潜在变量的动态生成由其他森林模型(这次是潜在树模型森林)驱动。后者的森林模型是在(Mourad等人,2011)中设计的一类贝叶斯网络,它允许对数据内存在的依赖项进行灵活的建模。在这里报告的综合研究中,使用高维真实世界数据集(161个数据集分别描述了大约5,000个观察值和之间的数据),将通过不同聚类方法(CAST,DBSCAN,Louvin方法)实例化的Sylva的三个变体与T-Trees进行了比较。 5,700和39,000个变量)。我们显示,T树和Sylva具有相当高的预测能力(ROC曲线下的区域),范围为[0.887,0.961](T树),区间为[0.885,0.979](在三个Sylva实例中) )。有趣的是,T树和Sylva在重要性度量分布上显示出显着不同:在Sylva中,与排名靠前的变量相对应的重要性度量分布明显偏向更高的值,这满足了特征选择增强目标。此属性对于Sylva的三个实例均适用。此外,对由T-Trees和Sylva共同确定的排名最高的变量的数量的全面分析,突出显示了对结果进行交叉验证的可能性,以构成特征的优先列表(例如,由生物学家进一步分析) ,在遗传关联研究的背景下)。最后,我们得出结论,建议在所分析的161个数据集上使用CAST或DBSCAN,而不是Louvain方法,以提高Sylva检测T-Trees排名最高的变量中错过的顶部变量的可能性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号