...
首页> 外文期刊>Journal of chemical information and modeling >Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: Focusing on applicability domain and overfitting by variable selection
【24h】

Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: Focusing on applicability domain and overfitting by variable selection

机译:关键的评估QSAR模型对梨形四膜虫的环境毒性:专注于适用范围和变量选择过度拟合

获取原文
获取原文并翻译 | 示例
           

摘要

The estimation of the accuracy of predictions is a critical problem in QSAR modeling. The "distance to model" can be defined as a metric that defines the similarity between the training set molecules and the test set compound for the given property in the context of a specific model. It could be expressed in many different ways, e.g., using Tanimoto coefficient, leverage, correlation in space of models, etc. In this paper we have used mixtures of Gaussian distributions as well as statistical tests to evaluate six types of distances to models with respect to their ability to discriminate compounds with small and large prediction errors. The analysis was performed for twelve QSAR models of aqueous toxicity against T. pyriformis obtained with different machine-learning methods and various types of descriptors. The distances to model based oil standard deviation of predicted toxicity calculated from the ensemble of models afforded the best results. This distance also successfully discriminated molecules with low and large prediction errors for a mechanism-based model developed using log P and the Maximum Acceptor Superdelocalizability descriptors. Thus, the distance to model metric could also be used to augment mechanistic QSAR models by estimating their prediction errors. Moreover, the accuracy of prediction is mainly determined by the training set data distribution in the chemistry and activity spaces but not by QSAR approaches used to develop the models. We have shown that incorrect validation of a model may result in the wrong estimation of its performance and suggested how this problem could be circumvented. The toxicity of 3182 and 48774 molecules from the EPA High Production Volume (HPV) Challenge Program and EINECS (European chemical Substances Information System), respectively, was predicted, and the accuracy of prediction was estimated. The developed models are available online at http://www.qspr.org site.
机译:预测准确性的估计是QSAR建模中的关键问题。 “与模型的距离”可以定义为一种度量,该度量定义在特定模型的上下文中针对给定属性的训练集分子与测试集化合物之间的相似性。它可以用许多不同的方式表示,例如使用Tanimoto系数,杠杆,模型空间中的相关性等。在本文中,我们使用了高斯分布的混合以及统计检验来评估关于模型的六种距离区分具有小和大预测误差的化合物的能力。对十二种QSAR模型进行了分析,该模型通过不同的机器学习方法和各种类型的描述词获得了对拟南芥的水毒性。从模型集合计算得出的基于模型的预测毒性油标准偏差的距离提供了最佳结果。对于使用对数P和最大受体超可离域性描述符开发的基于机理的模型,该距离还成功地区分了具有低和大预测误差的分子。因此,到模型度量的距离也可以通过估计其预测误差来用于增强机械QSAR模型。此外,预测的准确性主要取决于化学和活性空间中训练集的数据分布,而不取决于用于开发模型的QSAR方法。我们已经表明,对模型的不正确验证可能会导致对模型性能的错误估计,并提出了可以如何解决此问题的建议。分别从EPA高产量(HPV)挑战计划和EINECS(欧洲化学物质信息系统)中预测了3182和48774分子的毒性,并评估了预测的准确性。可以在http://www.qspr.org网站在线获得开发的模型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号