首页> 美国卫生研究院文献>Journal of Cheminformatics >Evaluating parameters for ligand-based modeling with random forest on sparse data sets
【2h】

Evaluating parameters for ligand-based modeling with random forest on sparse data sets

机译:在稀疏数据集上评估基于配体的随机森林建模参数

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Ligand-based predictive modeling is widely used to generate predictive models aiding decision making in e.g. drug discovery projects. With growing data sets and requirements on low modeling time comes the necessity to analyze data sets efficiently to support rapid and robust modeling. In this study we analyzed four data sets and studied the efficiency of machine learning methods on sparse data structures, utilizing Morgan fingerprints of different radii and hash sizes, and compared with molecular signatures descriptor of different height. We specifically evaluated the effect these parameters had on modeling time, predictive performance, and memory requirements using two implementations of random forest; Scikit-learn as well as FEST. We also compared with a support vector machine implementation. Our results showed that unhashed fingerprints yield significantly better accuracy than hashed fingerprints (p0.05), with no pronounced deterioration in modeling time and memory usage. Furthermore, the fast execution and low memory usage of the FEST algorithm suggest that it is a good alternative for large, high dimensional sparse data. Both support vector machines and random forest performed equally well but results indicate that the support vector machine was better at using the extra information from larger values of the Morgan fingerprint’s radius.Electronic supplementary materialThe online version of this article (10.1186/s13321-018-0304-9) contains supplementary material, which is available to authorized users.
机译:基于配体的预测建模被广泛用于生成预测模型,以帮助例如在决策过程中进行决策。药物发现项目。随着数据集的增长和对低建模时间的要求,必须有效分析数据集以支持快速而强大的建模。在这项研究中,我们分析了四个数据集并研究了机器学习方法在稀疏数据结构上的效率,利用了不同半径和哈希大小的Morgan指纹,并与不同高度的分子特征描述子进行了比较。我们使用随机森林的两种实现方式专门评估了这些参数对建模时间,预测性能和内存需求的影响。 Scikit学习以及FEST。我们还比较了支持向量机的实现。我们的结果表明,未哈希的指纹比哈希的指纹产生更好的准确性( p 0.05 ),建模时间和内存使用量没有明显下降。此外,FEST算法的快速执行和低内存使用率表明,它是大型,高维稀疏数据的不错选择。支持向量机和随机森林的性能均相当好,但结果表明,支持向量机更擅长使用Morgan指纹半径较大的值中的额外信息。电子补充材料本文的在线版本(10.1186 / s13321-018-0304 -9)包含补充材料,授权用户可以使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号