【24h】

Nonlinear Multivariate Polynomial Ensembles in QSAR/QSPR

机译:QSAR / QSPR中的非线性多元多项式集合

获取原文
获取原文并翻译 | 示例

摘要

In this study, which is part of Damir Nadramija's PhD thesis developed in collaboration with The Rudjer BoSkovic Institute, we demonstrate use of ensembles of linear and nonlinear multivariate regression models, based on multivariate polynomials of initial descriptors, in QSAR/QSPR modeling. Data sets, which varied significantly in size regarding number of variables and number of points, were all previously referenced in literature and molecular structures were either obtained from authorsof these publications or generated in our laboratories. All data sets were encoded as SMILES and converted to 3D structures (SD files) by the CORINA program (www2.chemie.uni-erlangen.de/software/corina/). All descriptors were computed by the program DRAGON 2.1 (http://www.disat.unimib.it/chm/). Linear ensembles were built with multiple linear regression models (MLR) and nonlinear ensembles consisted of multivariate polynomials, which were constructed as controlled subsets selected among linear descriptors, their two-fold cross-products and squares, as well as cubic potencies of (only) single descriptors. Ensemble responses were computed as mean or median or weighted values of all intrinsic models. Models and ensembles discussed in this paper were constructed with the application NQSAR, a Windows console application, which is available upon request. Results obtained show clear advantage of nonlinear ensembles over linear counterparts when data sets contain 4 to 5 times more points than model coefficients. On the other side linear ensembles, which in general exhibit higher robustness and stability, are better suited for small data sets with many variables outperforming nonlinear ensembles in predicting values of data points from external data set. This can be explained by the fact that the linear models are less affected by small variations than nonlinear models while they equally benefit from the key ensemble features. Primarily, we note the impact of the inclusion of more variables spread across optimized variable subsets, which are used in ensembles' intrinsic models that individually satisfy before mentioned rule on over-fitting. The overall ensemble responses are more stable and robust with higher predictive powers than single models.
机译:在这项研究中,这是Damir Nadramija与The Rudjer BoSkovic Institute合作开发的博士学位论文的一部分,我们在QSAR / QSPR建模中展示了基于初始描述符的多元多项式的线性和非线性多元回归模型的集成。数据集在变量数和点数方面的大小差异很大,以前都在文献中引用过,并且分子结构要么从这些出版物的作者那里获得,要么在我们的实验室中获得。所有数据集均被编码为SMILES,并通过CORINA程序(www2.chemie.uni-erlangen.de/software/corina/)转换为3D结构(SD文件)。所有描述符都是由DRAGON 2.1(http://www.disat.unimib.it/chm/)程序计算的。线性合奏使用多个线性回归模型(MLR)构建,非线性合奏由多元多项式组成,这些多项式被构建为从线性描述子,它们的两倍叉积和平方以及(仅)三次幂中选择的受控子集单个描述符。集合响应计算为所有内在模型的均值或中值或加权值。本文讨论的模型和集成是使用Windows控制台应用程序NQSAR构建的,该应用程序可应要求提供。当数据集包含比模型系数多4至5倍的点时,获得的结果显示出非线性合奏明显优于线性对应项。另一方面,线性集成通常表现出更高的鲁棒性和稳定性,它更适合于在从外部数据集预测数据点的值方面具有许多优于非线性集成的变量的小型数据集。这可以用以下事实解释:线性模型受非线性变化的影响较小,而非线性模型则同样受益于关键集成特征。首先,我们注意到包含更多分布在优化变量子集上的变量的影响,这些变量用于集成的内在模型中,这些模型分别满足前面提到的过拟合规则。整体合奏响应比单个模型具有更高的预测能力,因此更稳定,更可靠。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号