...
首页> 外文期刊>Journal of proteome research >Evaluation of Multivariate Classification Models for Analyzing NMR Metabolomics Data
【24h】

Evaluation of Multivariate Classification Models for Analyzing NMR Metabolomics Data

机译:对NMR代谢组数据分析的多变量分类模型的评价

获取原文
获取原文并翻译 | 示例
           

摘要

Analytical techniques such as NMR and mass spectrometry can generate large metabolomics data sets containing thousands of spectral features derived from numerous biological observations. Multivariate data analysis is routinely used to uncover the underlying biological information contained within these large metabolomics data sets. This is typically accomplished by classifying the observations into groups (e.g., control versus treated) and by identifying associated discriminating features. There are a variety of classification models to select from, which include some well-established techniques (e.g., principal component analysis [PCA], orthogonal projection to latent structure [OPLS], or partial least-squares projection to latent structures [PLS]) and newly emerging machine learning algorithms (e.g., support vector machines or random forests). However, it is unclear which classification model, if any, is an optimal choice for the analysis of metabolomics data. Herein, we present a comprehensive evaluation of five common classification models routinely employed in the metabolomics field and that are also currently available in our MVAPACK metabolomics software package. Simulated and experimental NMR data sets with various levels of group separation were used to evaluate each model. Model performance was assessed by classification accuracy rate, by the area under a receiver operating characteristic (AUROC) curve, and by the identification of true discriminating features. Our findings suggest that the five classification models perform equally well with robust data sets. Only when the models are stressed with subtle data set differences does OPLS emerge as the best-performing model. OPLS maintained a high-prediction accuracy rate and a large area under the ROC curve while yielding loadings closest to the true loadings with limited group separations.
机译:诸如NMR和质谱等的分析技术可以产生含有来自许多生物观察的数千种谱特征的大型代谢组数据集。多变量数据分析通常用于揭示这些大型代谢组数据集中包含的基础生物信息。这通常是通过将观察分类成基团(例如,控制与处理)来实现,并通过识别相关的辨别特征来完成。有多种分类模型来选择来自的,包括一些良好的技术(例如,主成分分析[PCA],潜在结构[OPLS]的正交投影,或潜在结构的部分最小二乘投影[PLS])和新出现的机器学习算法(例如,支持向量机或随机林)。然而,目前尚不清楚哪种分类模型(如果有的话)是分析代谢组数据数据的最佳选择。在此,我们展示了在Metabolomics领域经常使用的五种常见分类模型的综合评估,并且目前在我们的MVAPack代谢组件软件包中也可以使用。使用各种级别分离的模拟和实验NMR数据集用于评估每个模型。通过分类准确率,由接收器操作特征(Auroc)曲线下的面积进行分类精度评估模型性能,并通过识别真正的辨别特征。我们的研究结果表明,五种分类模型与强大的数据集同样良好。只有当模型受到细微的压力时,opls才会成为最佳性能的差异。 OPL在ROC曲线下保持了高预测精度率和大面积,同时产生最接近具有有限组分离的真实负载的装载。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号