...
首页> 外文期刊>Metabolomics >Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline
【24h】

Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline

机译:基于质谱的代谢组学中的缺失值:数据处理流程中被低估的步骤

获取原文
           

摘要

Missing values in mass spectrometry metabolomic datasets occur widely and can originate from a number of sources, including for both technical and biological reasons. Currently, little is known about these data, i.e. about their distributions across datasets, the need (or not) to consider them in the data processing pipeline, and most importantly, the optimal way of assigning them values prior to univariate or multivariate data analysis. Here, we address all of these issues using direct infusion Fourier transform ion cyclotron resonance mass spectrometry data. We have shown that missing data are widespread, accounting for ca. 20% of data and affecting up to 80% of all variables, and that they do not occur randomly but rather as a function of signal intensity and mass-to-charge ratio. We have demonstrated that missing data estimation algorithms have a major effect on the outcome of data analysis when comparing the differences between biological sample groups, including by t test, ANOVA and principal component analysis. Furthermore, results varied significantly across the eight algorithms that we assessed for their ability to impute known, but labelled as missing, entries. Based on all of our findings we identified the k-nearest neighbour imputation method (KNN) as the optimal missing value estimation approach for our direct infusion mass spectrometry datasets. However, we believe the wider significance of this study is that it highlights the importance of missing metabolite levels in the data processing pipeline and offers an approach to identify optimal ways of treating missing data in metabolomics experiments.
机译:质谱代谢组学数据集中的缺失值广泛存在,并且可能源自多种来源,包括出于技术和生物学原因。当前,对这些数据知之甚少,即关于它们在数据集中的分布,关于(是否需要)在数据处理管道中考虑它们的知识,以及最重要的是,在单变量或多变量数据分析之前为它们分配值的最佳方法知之甚少。在这里,我们使用直接注入傅里叶变换离子回旋共振质谱数据解决所有这些问题。我们已经表明,丢失的数据很普遍,大约占到大约5%。 20%的数据会影响多达80%的所有变量,并且它们不是随机发生的,而是信号强度和质荷比的函数。我们已经证明,当比较生物样本组之间的差异(包括t检验,方差分析和主成分分析)时,缺少数据估计算法会对数据分析的结果产生重大影响。此外,在我们评估了估算已知但被标记为缺失条目的能力的八种算法中,结果差异很大。基于我们的所有发现,我们将k最近邻插补方法(KNN)确定为直接输注质谱数据集的最佳缺失值估计方法。但是,我们认为这项研究的更广泛意义在于,它突出了数据处理流程中缺少代谢物水平的重要性,并提供了一种方法来识别代谢组学实验中治疗缺失数据的最佳方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号