首页> 外文期刊>Journal of the Brazilian Chemical Society >Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria
【24h】

Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria

机译:基于最大信息系数和支持向量回归基于rana Tempararia醇类化合物毒性毒性的非线性特征选择和QSAR建模

获取原文
           

摘要

Efficient evaluation of biotoxicity of organics is of vital significance to resource utilization and environmental protection. In this study, toxicity of 110 alcohol compounds to tadpoles of Rana temporaria is adopted as the dependent variable and 1388 physiochemical parameters (features) calculated by PCLIENT are used for representing each compound. A feature selection pipeline with three steps is developed to refine the feature subset: 282 features that significantly correlated with biotoxicity of chemical compounds are preliminarily selected via the maximum information coefficient (MIC); 138 descriptors that have positive contribution to the model’s performance are reserved after a support vector regression (SVR) based backward elimination; 18 descriptors are finally selected via a forward selection process that integrated minimal redundancy maximal relevance (mRMR), MIC and SVR. In terms of feature subsets with different numbers of variables, quantitative structure activity relationship (QSAR) models are built using multiple linear regression (MLR), partial least square regression (PLS) and SVR, respectively. The independent prediction evaluation index, Q2, increases from -74.787, 0.824 and 0.868 to 0.892, 0.878 and 0.940, for the three regression models, respectively. Results suggest that nonlinear feature selection methods involved in MIC and SVR can effectively eliminate irrelevant descriptors. SVR outperforms classical statistical models to QSAR modeling on high-dimensional data containing nonlinear relationship between features. The methods proposed in this study have a potential application in the QSAR research field such as biotoxicity compounds.
机译:有机物生物毒性的高效评估对资源利用和环境保护具有至关重要的意义。在该研究中,采用110醇类化合物对RANA Temporaria的蝌蚪的毒性作为由PClient计算的依赖变量和1388个生理化学参数(特征)来表示每个化合物。开发了具有三个步骤的特征选择管道以优化特征子集:282通过最大信息系数(MIC)预先选择与化学化合物的生物毒性显着相关的特征;在基于支持向量回归(SVR)的后向消除后,对模型性能具有积极贡献的138描述符;最终通过对最小冗余最大相关性(MRMR),MIC和SVR的前向选择过程来选择18描述符。就具有不同变量数量不同的特征子集而言,定量结构活动关系(QSAR)模型分别由多元线性回归(MLR),部分最小二乘回归(PLS)和SVR构建。独立预测评估指标Q2分别从-74.787,0.824和0.868增加到0.892,0.878和0.940,对于三个回归模型。结果表明,MIC和SVR中涉及的非线性特征选择方法可以有效地消除不相关的描述符。 SVR优于在包含非线性关系的高维数据上对QSAR建模的古典统计模型。本研究中提出的方法具有QSAR研究领域的潜在应用,例如生物毒性化合物。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号