首页> 外文期刊>Journal of Cheminformatics >Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons
【24h】

Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons

机译:QSPR模型中用于特征选择的随机森林-预测碳氢化合物形成标准焓的应用

获取原文
       

摘要

Background One of the main topics in the development of quantitative structure-property relationship (QSPR) predictive models is the identification of the subset of variables that represent the structure of a molecule and which are predictors for a given property. There are several automated feature selection methods, ranging from backward, forward or stepwise procedures, to further elaborated methodologies such as evolutionary programming. The problem lies in selecting the minimum subset of descriptors that can predict a certain property with a good performance, computationally efficient and in a more robust way, since the presence of irrelevant or redundant features can cause poor generalization capacity. In this paper an alternative selection method, based on Random Forests to determine the variable importance is proposed in the context of QSPR regression problems, with an application to a manually curated dataset for predicting standard enthalpy of formation. The subsequent predictive models are trained with support vector machines introducing the variables sequentially from a ranked list based on the variable importance. Results The model generalizes well even with a high dimensional dataset and in the presence of highly correlated variables. The feature selection step was shown to yield lower prediction errors with RMSE values 23% lower than without feature selection, albeit using only 6% of the total number of variables (89 from the original 1485). The proposed approach further compared favourably with other feature selection methods and dimension reduction of the feature space. The predictive model was selected using a 10-fold cross validation procedure and, after selection, it was validated with an independent set to assess its performance when applied to new data and the results were similar to the ones obtained for the training set, supporting the robustness of the proposed approach. Conclusions The proposed methodology seemingly improves the prediction performance of standard enthalpy of formation of hydrocarbons using a limited set of molecular descriptors, providing faster and more cost-effective calculation of descriptors by reducing their numbers, and providing a better understanding of the underlying relationship between the molecular structure represented by descriptors and the property of interest.
机译:背景技术开发定量结构-性质关系(QSPR)预测模型的主要主题之一是识别代表分子结构并且是给定特性的预测变量的子集。有几种自动特征选择方法,范围从向后,向前或逐步过程,到进一步完善的方法,例如进化编程。问题在于选择描述符的最小子集,这些描述符的子集可以以良好的性能,高效的计算效率和更健壮的方式预测某个属性,因为不相关或多余的特征的存在会导致较差的泛化能力。本文在QSPR回归问题的背景下,提出了一种基于随机森林来确定变量重要性的替代选择方法,并将其应用于手动选择的数据集以预测标准形成焓。随后的预测模型使用支持向量机进行训练,支持向量机根据变量的重要性从排名列表中依次引入变量。结果即使在具有高维数据集且存在高度相关变量的情况下,该模型也能很好地泛化。结果表明,特征选择步骤产生的预测误差较低,RMSE值比没有特征选择的预测误差低23%,尽管仅使用了总数的6%(原始变量1485中为89)。所提出的方法进一步有利地与其他特征选择方法和特征空间的尺寸减小进行比较。使用10倍交叉验证程序选择了预测模型,选择后,使用独立集进行了验证,以评估其应用于新数据时的性能,其结果与从训练集中获得的结果相似,从而支持所提出方法的鲁棒性。结论拟议的方法似乎使用有限的分子描述符集改善了烃形成的标准焓的预测性能,通过减少描述符的数量提供了更快,更具成本效益的描述符计算方法,并更好地理解了烃之间的潜在关系。描述符表示的分子结构和感兴趣的性质。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号