...
首页> 外文期刊>Frontiers in Molecular Biosciences >Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data
【24h】

Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data

机译:使用非靶向代谢组学数据进行早期预测生物标志物发现的特征选择方法

获取原文
           

摘要

Untargeted metabolomics is a powerful phenotyping tool for better understanding biological mechanisms involved in human pathology development and identifying early predictive biomarkers. This approach, based on multiple analytical platforms, such as mass spectrometry, chemometrics and bioinformatics, generates massive and complex data that need appropriate analyses to extract the biologically meaningful information. Despite various tools available, it is still a challenge to handle such large and noisy datasets with limited number of individuals without risking overfitting. Moreover, when the objective is focused on the identification of early predictive makers of clinical outcome, few years before occurrence, it becomes essential to use the appropriate algorithms and workflow to be able to discover subtle effects among this large amount of data. In this context, this work consists in studying a workflow describing the general feature selection process, using knowledge discovery and data mining methodologies to propose advanced solutions for predictive biomarker discovery. The strategy was focused on evaluating a combination of numeric-symbolic approaches for feature selection with the objective of obtaining the best combination of metabolites producing an effective and accurate predictive model. Relying first on numerical approaches, and especially on machine learning methods (SVM-RFE, RF, RF-RFE) and on univariate statistical analyses (ANOVA), a comparative study was performed on the original metabolomic dataset and reduced subsets. As resampling method, LOOCV was applied to minimize the risk of overfitting. The best k-features obtained with different scores of importance from the combination of these different approaches were compared and allowed determining the variable stabilities using Formal Concept Analysis. The results revealed the interest of RF-Gini combined with ANOVA for feature selection as these two complementary methods allowed selecting the 48 best candidates for prediction. Using linear logistic regression on this reduced dataset enabled us to obtain the best performances in terms of prediction accuracy and number of false positive with a model including 5 top variables. Therefore, these results highlighted the interest of feature selection methods and the importance of working on reduced datasets for the identification of predictive biomarkers issued from untargeted metabolomics data.
机译:非靶向代谢组学是一种强大的表型分析工具,可用于更好地了解人类病理学发展中涉及的生物学机制并识别早期预测性生物标志物。这种基于多种分析平台(例如质谱,化学计量学和生物信息学)的方法可生成大量且复杂的数据,需要进行适当的分析以提取具有生物学意义的信息。尽管有各种可用的工具,但要处理数量有限的庞大而嘈杂的数据集而又不冒过度拟合的风险仍然是一个挑战。此外,当目标是确定临床结果的早期预测制定者时,即在发生前几年,必须使用适当的算法和工作流程才能发现大量数据中的细微影响。在这种情况下,这项工作包括研究描述通用特征选择过程的工作流,并使用知识发现和数据挖掘方法为预测性生物标志物发现提出高级解决方案。该策略的重点是评估用于特征选择的数字符号方法的组合,目的是获得代谢物的最佳组合,从而产生有效而准确的预测模型。首先依靠数值方法,尤其是依靠机器学习方法(SVM-RFE,RF,RF-RFE)和单变量统计分析(ANOVA),对原始代谢组学数据集和减少的子集进行了比较研究。作为重采样方法,使用LOOCV可以最大程度地降低过度拟合的风险。比较了从这些不同方法的组合中获得不同重要性分数的最佳k特征,并使用形式概念分析确定了变量的稳定性。结果揭示了RF-Gini与ANOVA结合进行特征选择的兴趣,因为这两种互补方法允许选择48种最佳候选物进行预测。在此精简数据集上使用线性逻辑回归可以使我们在包含5个主要变量的模型的预测准确性和假阳性数方面获得最佳性能。因此,这些结果突显了特征选择方法的兴趣以及在简化的数据集上进行工作以识别由非靶向代谢组学数据发布的预测性生物标志物的重要性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号