Evaluating parameters for ligand-based modeling with random forest on sparse data sets

机译：在稀疏数据集上评估基于配体的随机森林建模参数

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

Ligand-based predictive modeling is widely used to generate predictive models aiding decision making in e.g. drug discovery projects. With growing data sets and requirements on low modeling time comes the necessity to analyze data sets efficiently to support rapid and robust modeling. In this study we analyzed four data sets and studied the efficiency of machine learning methods on sparse data structures, utilizing Morgan fingerprints of different radii and hash sizes, and compared with molecular signatures descriptor of different height. We specifically evaluated the effect these parameters had on modeling time, predictive performance, and memory requirements using two implementations of random forest; Scikit-learn as well as FEST. We also compared with a support vector machine implementation. Our results showed that unhashed fingerprints yield significantly better accuracy than hashed fingerprints (

p \leq 0.05

), with no pronounced deterioration in modeling time and memory usage. Furthermore, the fast execution and low memory usage of the FEST algorithm suggest that it is a good alternative for large, high dimensional sparse data. Both support vector machines and random forest performed equally well but results indicate that the support vector machine was better at using the extra information from larger values of the Morgan fingerprint’s radius.Electronic supplementary materialThe online version of this article (10.1186/s13321-018-0304-9) contains supplementary material, which is available to authorized users.

机译：基于配体的预测建模被广泛用于生成预测模型，以帮助例如在决策过程中进行决策。药物发现项目。随着数据集的增长和对低建模时间的要求，必须有效分析数据集以支持快速而强大的建模。在这项研究中，我们分析了四个数据集并研究了机器学习方法在稀疏数据结构上的效率，利用了不同半径和哈希大小的Morgan指纹，并与不同高度的分子特征描述子进行了比较。我们使用随机森林的两种实现方式专门评估了这些参数对建模时间，预测性能和内存需求的影响。 Scikit学习以及FEST。我们还比较了支持向量机的实现。我们的结果表明，未哈希的指纹比哈希的指纹产生更好的准确性（

p\leq0.05），建模时间和内存使用量没有明显下降。此外，FEST算法的快速执行和低内存使用率表明，它是大型，高维稀疏数据的不错选择。支持向量机和随机森林的性能均相当好，但结果表明，支持向量机更擅长使用Morgan指纹半径较大的值中的额外信息。电子补充材料本文的在线版本（10.1186 / s13321-018-0304 -9）包含补充材料，授权用户可以使用。

著录项

期刊名称 Journal of Cheminformatics
作者
Alexander Kensert; Jonathan Alvarsson; Ulf Norinder; Ola Spjuth;
展开▼
作者单位

展开▼
年(卷),期 2018(10),-1
年度 2018
页码 49
总页数 10
原文格式 PDF
正文语种
中图分类生化遗传学;生化药理学;
关键词
Random forest Support vector machines Sparse representation Fingerprint Machine learning;

机译：随机森林;支持向量机;稀疏表示;指纹;机器学习;

相似文献

外文文献
中文文献
专利

1. Random Forest-Based Prospectivity Modelling of Greenfield Terrains Using Sparse Deposit Data: An Example from the Tanami Region, Western Australia [J] . Siddharth Hariharan, Siddhesh Tirodkar, Alok Porwal Natural resources research . 2017,第4期

机译：稀疏存款数据的绿地地带随机林的勘探建模：澳大利亚西澳大利亚Tanami地区的一个例子
2. Random forests and the data sparseness problem in language modeling [J] . Peng Xu, Frederick Jelinek Computer speech and language . 2007,第1期

机译：语言建模中的随机森林和数据稀疏问题
3. Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets [J] . Robinson Richard L. Marchese, Palczewska Anna, Palczewski Jan, Journal of chemical information and modeling . 2017,第8期

机译：基准数据集对随机林和线性模型的预测性能和解释性的比较
4. Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets [C] . Nasim Adnan, Zahidul Islam European symposium on artificial neural networks, computational intelligence and machine learning . 2015

机译：通过随机改变低维数据集的自举样本的大小来改进随机森林算法
5. Random forests and the data sparseness problem in language modeling. [D] . Xu, Peng. 2005

机译：语言建模中的随机森林和数据稀疏问题。
6. Determination of robust ocular pharmacokinetic parameters in serum and vitreous humor of albino rabbits following systemic administration of ciprofloxacin from sparse data sets by using IT2S a population pharmacokinetic modeling program. [O] . G L Drusano, W Liu, R Perkins, 1995

机译：通过使用人口药代动力学建模程序IT2S从稀疏数据集中系统施用环丙沙星后确定白化兔血清和玻璃体液中的强健眼药代动力学参数。
7. Exploiting random projections and sparsity with random forests and gradient boosting methods - Application to multi-label and multi-output learning, random forest model compression and leveraging input sparsity [O] . Joly, Arnaud 2017

机译：利用随机森林和梯度提升方法开发随机投影和稀疏度-应用于多标签和多输出学习，随机森林模型压缩和利用输入稀疏度

Evaluating parameters for ligand-based modeling with random forest on sparse data sets

摘要

著录项

相似文献

相关主题

期刊订阅