首页> 外文期刊>Journal of Computer-Aided Molecular Design >Effect of training data size and noise level on support vector machines virtual screening of genotoxic compounds from large compound libraries
【24h】

Effect of training data size and noise level on support vector machines virtual screening of genotoxic compounds from large compound libraries

机译:训练数据大小和噪声水平对支持向量机从大型化合物库中虚拟筛选遗传毒性化合物的影响

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Various in vitro and in-silico methods have been used for drug genotoxicity tests, which show limited genotoxicity (GT+) and non-genotoxicity (GT-) identification rates. New methods and combinatorial approaches have been explored for enhanced collective identification capability. The rates of in-silco methods may be further improved by significantly diversified training data enriched by the large number of recently reported GT+ and GT- compounds, but a major concern is the increased noise levels arising from high false-positive rates of in vitro data. In this work, we evaluated the effect of training data size and noise level on the performance of support vector machines (SVM) method known to tolerate high noise levels in training data. Two SVMs of different diversityoise levels were developed and tested. H-SVM trained by higher diversity higher noise data (GT+ in any in vivo or in vitro test) outperforms L-SVM trained by lower noise lower diversity data (GT+ in in vivo or Ames test only). H-SVM trained by 4,763 GT+ compounds reported before 2008 and 8,232 GT- compounds excluding clinical trial drugs correctly identified 81.6% of the 38 GT+ compounds reported since 2008, predicted 83.1% of the 2,008 clinical trial drugs as GT-, and 23.96% of 168 K MDDR and 27.23% of 17.86M PubChem compounds as GT+. These are comparable to the 43.1-51.9% GT+ and 75-93% GT- rates of existing in-silico methods, 58.8% GT+ and 79% GT- rates of Ames method, and the estimated percentages of 23% in vivo and 31-33% in vitro GT+ compounds in the "universe of chemicals". There is a substantial level of agreement between H-SVM and L-SVM predicted GT+ and GT- MDDR compounds and the prediction from TOPKAT. SVM showed good potential in identifying GT+ compounds from large compound libraries based on higher diversity and higher noise training data.
机译:各种体外和计算机模拟方法已用于药物遗传毒性测试,这些方法显示出有限的遗传毒性(GT +)和非遗传毒性(GT-)识别率。为了增强集体识别能力,已经探索了新的方法和组合方法。大量多样的最新报道的GT +和GT-化合物丰富了大量多样化的训练数据,可以进一步提高硅胶法的使用率,但主要的担忧是,由于体外数据的假阳性率高,导致噪声水平升高。在这项工作中,我们评估了训练数据大小和噪声水平对支持向量机(SVM)方法性能的影响,该方法已知可以承受训练数据中的高噪声水平。开发并测试了两个具有不同多样性/噪声水平的SVM。通过较高分集的较高噪声数据训练的H-SVM(在任何体内或体外试验中均通过GT +训练)优于通过较低噪声的低多样性数据训练的L-SVM(仅在体内或Ames试验中进行GT +)训练。由2008年之前报告的4,763种GT +化合物和不包括临床试验药物的8,232种GT-化合物训练的H-SVM正确识别了自2008年以来报告的38种GT +化合物中的81.6%,预测2,008种临床试验药物中的83.1%为GT-和23.96% 168 K MDDR和17.86M PubChem化合物中的27.23%为GT +。这些可与现有计算机模拟方法中的GT。和GT-比率分别为43.1-51.9%和75-93%,Ames方法分别为58.8%和79%的估计比率,体内23%的估计百分比和31- “化学宇宙”中33%的体外GT +化合物。 H-SVM和L-SVM预测的GT +和GT-MDDR化合物与TOPKAT的预测之间存在相当程度的共识。 SVM在较高的多样性和较高的噪声训练数据的基础上,具有从大型化合物库中鉴定GT +化合物的潜力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号