Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons

Ana L Teixeira; Jo?o P Leal; Andre O Falcao

首页> 外文期刊>Journal of Cheminformatics >Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons

【24h】

Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons

机译：QSPR模型中用于特征选择的随机森林-预测碳氢化合物形成标准焓的应用

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background One of the main topics in the development of quantitative structure-property relationship (QSPR) predictive models is the identification of the subset of variables that represent the structure of a molecule and which are predictors for a given property. There are several automated feature selection methods, ranging from backward, forward or stepwise procedures, to further elaborated methodologies such as evolutionary programming. The problem lies in selecting the minimum subset of descriptors that can predict a certain property with a good performance, computationally efficient and in a more robust way, since the presence of irrelevant or redundant features can cause poor generalization capacity. In this paper an alternative selection method, based on Random Forests to determine the variable importance is proposed in the context of QSPR regression problems, with an application to a manually curated dataset for predicting standard enthalpy of formation. The subsequent predictive models are trained with support vector machines introducing the variables sequentially from a ranked list based on the variable importance. Results The model generalizes well even with a high dimensional dataset and in the presence of highly correlated variables. The feature selection step was shown to yield lower prediction errors with RMSE values 23% lower than without feature selection, albeit using only 6% of the total number of variables (89 from the original 1485). The proposed approach further compared favourably with other feature selection methods and dimension reduction of the feature space. The predictive model was selected using a 10-fold cross validation procedure and, after selection, it was validated with an independent set to assess its performance when applied to new data and the results were similar to the ones obtained for the training set, supporting the robustness of the proposed approach. Conclusions The proposed methodology seemingly improves the prediction performance of standard enthalpy of formation of hydrocarbons using a limited set of molecular descriptors, providing faster and more cost-effective calculation of descriptors by reducing their numbers, and providing a better understanding of the underlying relationship between the molecular structure represented by descriptors and the property of interest.

机译：背景技术开发定量结构-性质关系（QSPR）预测模型的主要主题之一是识别代表分子结构并且是给定特性的预测变量的子集。有几种自动特征选择方法，范围从向后，向前或逐步过程，到进一步完善的方法，例如进化编程。问题在于选择描述符的最小子集，这些描述符的子集可以以良好的性能，高效的计算效率和更健壮的方式预测某个属性，因为不相关或多余的特征的存在会导致较差的泛化能力。本文在QSPR回归问题的背景下，提出了一种基于随机森林来确定变量重要性的替代选择方法，并将其应用于手动选择的数据集以预测标准形成焓。随后的预测模型使用支持向量机进行训练，支持向量机根据变量的重要性从排名列表中依次引入变量。结果即使在具有高维数据集且存在高度相关变量的情况下，该模型也能很好地泛化。结果表明，特征选择步骤产生的预测误差较低，RMSE值比没有特征选择的预测误差低23％，尽管仅使用了总数的6％（原始变量1485中为89）。所提出的方法进一步有利地与其他特征选择方法和特征空间的尺寸减小进行比较。使用10倍交叉验证程序选择了预测模型，选择后，使用独立集进行了验证，以评估其应用于新数据时的性能，其结果与从训练集中获得的结果相似，从而支持所提出方法的鲁棒性。结论拟议的方法似乎使用有限的分子描述符集改善了烃形成的标准焓的预测性能，通过减少描述符的数量提供了更快，更具成本效益的描述符计算方法，并更好地理解了烃之间的潜在关系。描述符表示的分子结构和感兴趣的性质。

著录项

来源
《Journal of Cheminformatics》 |2013年第s1期|共页
作者
Ana L Teixeira; Jo?o P Leal; Andre O Falcao;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类化学;
关键词

相似文献

外文文献
中文文献
专利

1. A norm indexes-based QSPR model for predicting the standard vaporization enthalpy and formation enthalpy of organic compounds [J] . Fluid Phase Equilibria . 2020,第期

机译：一种基于范数的基于指标的QSPR模型，用于预测有机化合物的标准汽化焓和形成焓
2. Predictive Feature Generation and Selection Using Process Data From PISA Interactive Problem-Solving Items: An Application of Random Forests [J] . Zhuangzhuang Han, Qiwei He, Matthias von Davier Frontiers in Psychology . 2019,第a期

机译：使用PISA交互式问题解决项目的过程数据的预测特征生成和选择：随机林的应用
3. Predictive Feature Generation and Selection Using Process Data From PISA Interactive Problem-Solving Items: An Application of Random Forests [J] . Han Zhuangzhuang, He Qiwei, von Davier Matthias Frontiers in Psychology . 2019,第2期

机译：使用PISA交互式问题解决项目的过程数据的预测特征生成和选择：随机林的应用
4. Predictive modeling of Pan Evaporation using Random Forest Algorithm along with Features Selection [C] . Rakhee, Archana Singh, Mamta Mittal, . 2020

机译：使用随机森林算法和特征选择的锅蒸发预测模型
5. Methods of variable selection and their applications in quantitative structure-property relationship (QSPR). [D] . Peng, Xiaoling. 2005

机译：变量选择方法及其在定量结构-性质关系（QSPR）中的应用。
6. Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons [O] . Ana L Teixeira, João P Leal, Andre O Falcao 2013

机译：QSPR模型中用于特征选择的随机森林-预测碳氢化合物形成标准焓的应用
7. Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons [O] . Ana L Teixeira, João P Leal, Andre O Falcao 2013

机译：QSPR模型中用于特征选择的随机森林-预测碳氢化合物形成标准焓的应用

Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons

摘要

著录项

相似文献

相关主题

期刊订阅