...
首页> 外文期刊>Journal of chemical theory and computation: JCTC >Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error
【24h】

Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error

机译:分子机器学习模型的预测误差低于混合DFT误差

获取原文
获取原文并翻译 | 示例

摘要

We investigate the impact of choosing regressors and molecular representations for the construction of fast machine learning (ML) models of 13 electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed using learning curves which report out-of-sample errors as a function of training set size with up to similar to 118k distinct molecules. Molecular structures and properties at the hybrid density functional theory (DFT) level of theory come from the QM9 database [Ramakrishnan et al. Sri. Data 2014 1, 140022] and include enthalpies and free energies of atomization, HOMO/LUMO energies and gap, dipole moment, polarizability, zero point-vibrational energy, heat capacity, and the highest fundamental vibrational frequency. Various molecular representations have been studied (Coulomb matrix, bag of bonds, BAML and ECFP4, molecular graphs (MG)), as well as newly developed distribution based variants including histograms of distances (HD), angles (HDA/MARAD), and dihedrals (HDAD). Regressors include linear models (Bayesian ridge regression (BR) and linear regression with elastic net regularization (EN)), random forest (RF), kernel ridge regression (KRR), and two types of neural networks, graph convolutions (GC) and gated graph networks (GG). Out-of sample errors are strongly dependent on the choice of representation and regressor and molecular property. Electronic properties are typically best accounted for by MG and GC, while energetic properties are better described by HDAD and KRK The specific combinations with the lowest out-of-sample errors in the similar to 118k training set size limit are (free) energies and enthalpies of atomization (HDAD/KRR), HOMO/LUMO eigenvalue and gap (MG/GC), dipole moment (MG/GC), static polarizability (MG/GG), zero point vibrational energy (HDAD/KRR), heat capacity at room temperature (HDAD/KRR), and highest fundamental vibrational frequency (BAML/RF). We present numerical evidence that ML model predictions deviate from DFT (B3LYP) less than DFT (B3LYP) deviates from experiment for all properties. Furthermore, out-of-sample prediction errors with respect to hybrid DFT reference are on par with, or close to, chemical accuracy. The results suggest that ML models could be more accurate than hybrid DFT if explicitly electron correlated quantum (or experimental) data were available.
机译:我们研究了选择回归和分子表示的影响,为有机分子的13个电子地基性质的快速机械学习(ML)模型进行了影响。使用学习曲线评估每个回归/表示/属性组合的性能,这些曲线报告样本误差作为训练设定尺寸的函数,最多可与118K不同的分子相似。混合密度泛函理论(DFT)理论水平的分子结构和性质来自QM9数据库[Ramakrishnan等人。斯里。数据2014 1,140022]并包括雾化的焓和自由能,同性恋/八方能量和间隙,偶极矩,极化性,零振动能量,热容量和最高根本的振动频率。已经研究了各种分子表示(Coulomb基质,键,BAML和ECFP4,分子图(Mg))以及新开发的基于分布的变体,包括距离(HD),角度(HDA / MARAD)和Dihedrals的直方图(HDAD)。回归包括线性模型(贝叶斯岭回归(BR)和带有弹性净正规化的线性回归(Zh)),随机森林(RF),内核脊回归(KRR)和两种类型的神经网络,图形卷曲(GC)和门控图形网络(GG)。样本误差强烈依赖于代表性和回归和分子特性的选择。电子特性通常由MG和GC占用,而HDAD和KRK更好地描述了具有最低样本误差的特定组合,类似于118K训练集尺寸限制(免费)能量和焓雾化(HDAD / KRR),HOMO / LUMO特征值和间隙(Mg / GC),偶极矩(Mg / GC),静态极化性(Mg / Gg),零点振动能量(HDAD / KRR),房间的热容量温度(HDAD / KRR)和最高的基波振动频率(BAML / RF)。我们呈现数值证据,即ML模型预测偏离的DFT(B3LYP)小于DFT(B3LYP)偏离所有性质的实验。此外,关于混合DFT参考的样本超预测误差是针对化学精度的或接近化学精度的。结果表明,如果明确的电子相关量子(或实验)数据可用,ML模型可能比混合DFT更准确。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号