首页> 外文期刊>Analytica chimica acta >A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data
【24h】

A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data

机译:质谱数据分析中现代特征选择和分类方法的比较研究

获取原文
获取原文并翻译 | 示例
           

摘要

Many analytical approaches such as mass spectrometry generate large amounts of data (input variables) per sample analysed, and not all of these variables are important or related to the target output of interest. The selection of a smaller number of variables prior to sample classification is a widespread task in many research studies, where attempts are made to seek the lowest possible set of variables that are still able to achieve a high level of prediction accuracy; in other words, there is a need to generate the most parsimonious solution when the number of input variables is huge but the number of samples/ objects are smaller. Here, we compare several different variable selection approaches in order to ascertain which of these are ideally suited to achieve this goal. All variable selection approaches were applied to the analysis of a common set of metabolomics data generated by Curie-point pyrolysis mass spectrometry (Py-MS), where the goal of the study was to classify the Gram-positive bacteria Bacillus. These approaches include stepwise forward variable selection, used for linear discriminant analysis (LDA); variable importance for projection (VIP) coefficient, employed in partial least squares-discriminant analysis (PLS-DA); support vector machines-recursive feature elimination (SVM-RFE); as well as the mean decrease in accuracy and mean decrease in Gini, provided by random forests (RF). Finally, a double cross-validation procedure was applied to minimize the consequence of overfitting. The results revealed that RF with its variable selection techniques and SVM combined with SVM-RFE as a variable selection method, displayed the best results in comparison to other approaches.
机译:许多分析方法(例如质谱法)会为每个分析的样品生成大量数据(输入变量),并且并非所有这些变量都重要或与目标输出有关。在样本分类之前,选择较少数量的变量是许多研究工作中的一项广泛任务,在这些研究中,人们试图寻找尽可能低的变量集,但这些变量仍然能够实现较高的预测精度。换句话说,当输入变量的数量很大而样本/对象的数量较小时,则需要生成最简约的解决方案。在这里,我们比较了几种不同的变量选择方法,以确定哪种方法最适合实现此目标。所有变量选择方法均应用于分析由居里点热解质谱法(Py-MS)生成的一组通用的代谢组学数据,该研究的目的是对革兰氏阳性细菌芽孢杆菌进行分类。这些方法包括用于线性判别分析(LDA)的逐步前向变量选择;偏最小二乘判别分析(PLS-DA)中采用的可变的投影重要性(VIP)系数;支持向量机递归特征消除(SVM-RFE);以及随机森林(RF)提供的准确度平均下降和基尼平均下降。最后,应用双重交叉验证程序以最大程度地减少过度拟合的结果。结果表明,RF及其可变选择技术和SVM结合SVM-RFE作为变量选择方法,与其他方法相比,显示出最好的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号