首页> 外文期刊>Energy & fuels >BioCompoundML: A General Biofuel Property Screening Tool for Biological Molecules Using Random Forest Classifiers
【24h】

BioCompoundML: A General Biofuel Property Screening Tool for Biological Molecules Using Random Forest Classifiers

机译:BioCompoundML:使用随机森林分类器的生物分子通用生物燃料特性筛选工具

获取原文
获取原文并翻译 | 示例
       

摘要

Screening a large number of biologically derived molecules for potential fuel compounds without recourse to experimental testing is important in identifying understudied yet valuable molecules. Experimental testing, although a valuable standard for measuring fuel properties, has several major limitations, including the requirement of testably high quantities, considerable expense, and a large amount of time. This paper discusses the development of a general-purpose fuel property tool, using machine learning, whose outcome is to screen molecules for desirable fuel properties. BioCompoundML adopts a general methodology, requiring as input only a list of training compounds (with identifiers and measured values) and a list of testing compounds (with identifiers). For the training data, BioCompoundML collects open data from the National Center for Biotechnology Information, incorporates user-provided features, imputes missing values, performs feature reduction, builds a classifier, and clusters compounds. BioCompoundML then collects data for the testing compounds, predicts class membership, and determines whether compounds are found in the range of variability of the training data set. This tool is demonstrated using three different fuel properties: research octane number (RON), threshold soot index (TSI), and melting point (MP). We provide measures of its success with these properties using randomized train/test measurements: average accuracy is 88% in RON, 85% in TSI, and 94% in MP; average precision is 88% in RON, 88% in TSI, and 95% in MP; and average recall is 88% in RON, 82% in TSI, and 97% in MP. The receiver operator characteristics (area under the curve) were estimated at 0.88 in RON, 0.86 in TSI, and 0.87 in MP. We also measured the success of BioCompoundML by sending 16 compounds for direct RON determination. Finally, we provide a screen of 1977 hydrocarbons/oxygenates within the 8696 compounds in MetaCyc, identifying compounds with high predictive strength for high or low RON.
机译:在不借助实验测试的情况下,筛选大量生物衍生的分子以寻找潜在的燃料化合物,对于鉴定尚未充分研究但有价值的分子很重要。实验测试虽然是衡量燃料特性的有价值的标准,但它有几个主要限制,包括需要大量测试,大量费用和大量时间。本文讨论了使用机器学习的通用燃料特性工具的开发,其结果是为所需的燃料特性筛选分子。 BioCompoundML采用一种通用方法,只需输入一份培训化合物列表(带有标识符和测量值)和一份测试化合物列表(带有标识符)作为输入。对于培训数据,BioCompoundML从国家生物技术信息中心收集开放数据,合并用户提供的功能,估算缺失值,进行功能归约,构建分类器,并对化合物进行聚类。然后,BioCompoundML收集测试化合物的数据,预测类成员,并确定是否在训练数据集的可变范围内找到化合物。该工具使用三种不同的燃料特性进行了演示:研究辛烷值(RON),阈值烟灰指数(TSI)和熔点(MP)。我们使用随机训练/测试测量来提供这些属性的成功度量:RON的平均准确度为88%,TSI的平均准确度为85%,MP的平均准确度为94%; RON的平均精度为88%,TSI的平均精度为88%,MP的平均精度为95%; RON的平均召回率为88%,TSI的平均召回率为82%,MP的平均召回率为97%。接收机操作员特性(曲线下的面积)估计为RON为0.88,TSI为0.86,MP为0.87。我们还通过发送16种化合物进行直接RON测定来衡量BioCompoundML的成功。最后,我们提供了MetaCyc的8696种化合物中1977种碳氢化合物/加氧化合物的筛选,从而鉴定出对高或低RON具有高预测强度的化合物。

著录项

  • 来源
    《Energy & fuels》 |2016年第10期|8410-8418|共9页
  • 作者单位

    Sandia Natl Labs, Livermore, CA 94551 USA|Joint BioEnergy Inst, Emeryville, CA 94608 USA;

    Sandia Natl Labs, Livermore, CA 94551 USA;

    Natl Renewable Energy Lab, Golden, CO 80401 USA;

    Sandia Natl Labs, Livermore, CA 94551 USA|Joint BioEnergy Inst, Emeryville, CA 94608 USA;

    Sandia Natl Labs, Livermore, CA 94551 USA|Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA;

    Sandia Natl Labs, Livermore, CA 94551 USA|Joint BioEnergy Inst, Emeryville, CA 94608 USA;

    Sandia Natl Labs, Livermore, CA 94551 USA|Joint BioEnergy Inst, Emeryville, CA 94608 USA;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);美国《生物学医学文摘》(MEDLINE);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-18 00:39:58

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号