首页> 美国卫生研究院文献>Molecules >Error Tolerance of Machine Learning Algorithms across Contemporary Biological Targets
【2h】

Error Tolerance of Machine Learning Algorithms across Contemporary Biological Targets

机译:跨当代生物目标的机器学习算法的容错性

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Machine learning continues to make strident advances in the prediction of desired properties concerning drug development. Problematically, the efficacy of machine learning in these arenas is reliant upon highly accurate and abundant data. These two limitations, high accuracy and abundance, are often taken together; however, insight into the dataset accuracy limitation of contemporary machine learning algorithms may yield insight into whether non-bench experimental sources of data may be used to generate useful machine learning models where there is a paucity of experimental data. We took highly accurate data across six kinase types, one GPCR, one polymerase, a human protease, and HIV protease, and intentionally introduced error at varying population proportions in the datasets for each target. With the generated error in the data, we explored how the retrospective accuracy of a Naïve Bayes Network, a Random Forest Model, and a Probabilistic Neural Network model decayed as a function of error. Additionally, we explored the ability of a training dataset with an error profile resembling that produced by the Free Energy Perturbation method (FEP+) to generate machine learning models with useful retrospective capabilities. The categorical error tolerance was quite high for a Naïve Bayes Network algorithm averaging 39% error in the training set required to lose predictivity on the test set. Additionally, a Random Forest tolerated a significant degree of categorical error introduced into the training set with an average error of 29% required to lose predictivity. However, we found the Probabilistic Neural Network algorithm did not tolerate as much categorical error requiring an average of 20% error to lose predictivity. Finally, we found that a Naïve Bayes Network and a Random Forest could both use datasets with an error profile resembling that of FEP+. This work demonstrates that computational methods of known error distribution like FEP+ may be useful in generating machine learning models not based on extensive and expensive in vitro-generated datasets.
机译:机器学习继续在有关药物开发的所需特性的预测中取得重大进展。问题在于,机器学习在这些领域的有效性取决于高度准确和丰富的数据。经常将高精度和丰度这两个限制放在一起考虑。但是,对当代机器学习算法的数据集准确性限制的了解可能会导致对缺乏实验数据的非基准实验数据源是否可用于生成有用的机器学习模型的了解。我们获取了六种激酶类型,一种GPCR,一种聚合酶,一种人蛋白酶和HIV蛋白酶的高度准确的数据,并故意在每个目标的数据集中以不同的人口比例引入了误差。利用数据中产生的误差,我们探索了朴素贝叶斯网络,随机森林模型和概率神经网络模型的回顾精度如何随误差而衰减。此外,我们探索了误差数据类似于自由能量摄动法(FEP +)产生的训练数据集的能力,以生成具有有用的回顾功能的机器学习模型。对于朴素贝叶斯网络算法,分类错误容忍度非常高,平均训练集中的39%误差会导致测试集失去预测性。此外,随机森林可以容忍引入训练集中的很大程度的分类错误,而失去预测性所需的平均错误为29%。但是,我们发现概率神经网络算法不能容忍那么多的分类误差,而平均误差要求平均20%的误差才能失去预测能力。最终,我们发现朴素贝叶斯网络和随机森林都可以使用错误配置文件类似于FEP +的数据集。这项工作表明,已知错误分布的计算方法(如FEP +)可能在不基于大量且昂贵的体外生成的数据集生成机器学习模型时有用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号