首页> 美国卫生研究院文献>other >An evolving computational platform for biological mass spectrometry: workflows statistics and data mining with MASSyPup64
【2h】

An evolving computational platform for biological mass spectrometry: workflows statistics and data mining with MASSyPup64

机译:不断发展的生物质谱计算平台:使用MASSyPup64的工作流程统计数据和数据挖掘

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

In biological mass spectrometry, crude instrumental data need to be converted into meaningful theoretical models. Several data processing and data evaluation steps are required to come to the final results. These operations are often difficult to reproduce, because of too specific computing platforms. This effect, known as ‘workflow decay’, can be diminished by using a standardized informatic infrastructure. Thus, we compiled an integrated platform, which contains ready-to-use tools and workflows for mass spectrometry data analysis. Apart from general unit operations, such as peak picking and identification of proteins and metabolites, we put a strong emphasis on the statistical validation of results and Data Mining. MASSyPup64 includes e.g., the OpenMS/TOPPAS framework, the Trans-Proteomic-Pipeline programs, the ProteoWizard tools, X!Tandem, Comet and SpiderMass. The statistical computing language R is installed with packages for MS data analyses, such as XCMS/metaXCMS and MetabR. The R package Rattle provides a user-friendly access to multiple Data Mining methods. Further, we added the non-conventional spreadsheet program teapot for editing large data sets and a command line tool for transposing large matrices. Individual programs, console commands and modules can be integrated using the Workflow Management System (WMS) taverna. We explain the useful combination of the tools by practical examples: (1) A workflow for protein identification and validation, with subsequent Association Analysis of peptides, (2) Cluster analysis and Data Mining in targeted Metabolomics, and (3) Raw data processing, Data Mining and identification of metabolites in untargeted Metabolomics. Association Analyses reveal relationships between variables across different sample sets. We present its application for finding co-occurring peptides, which can be used for target proteomics, the discovery of alternative biomarkers and protein–protein interactions. Data Mining derived models displayed a higher robustness and accuracy for classifying sample groups in targeted Metabolomics than cluster analyses. Random Forest models do not only provide predictive models, which can be deployed for new data sets, but also the variable importance. We demonstrate that the later is especially useful for tracking down significant signals and affected pathways in untargeted Metabolomics. Thus, Random Forest modeling supports the unbiased search for relevant biological features in Metabolomics. Our results clearly manifest the importance of Data Mining methods to disclose non-obvious information in biological mass spectrometry . The application of a Workflow Management System and the integration of all required programs and data in a consistent platform makes the presented data analyses strategies reproducible for non-expert users. The simple remastering process and the Open Source licenses of MASSyPup64 () enable the continuous improvement of the system.
机译:在生物质谱法中,需要将原始仪器数据转换为有意义的理论模型。最终结果需要几个数据处理和数据评估步骤。由于运算平台太具体,这些操作通常很难重现。通过使用标准化的信息基础架构,可以消除这种称为​​“工作流衰减”的影响。因此,我们编译了一个集成平台,其中包含用于质谱数据分析的现成工具和工作流程。除了常规单位操作(例如峰选择以及蛋白质和代谢物的鉴定)外,我们还非常重视结果的统计验证和数据挖掘。 MASSyPup64包括例如OpenMS / TOPPAS框架,跨蛋白质组管道程序,ProteoWizard工具,X!Tandem,Comet和SpiderMass。统计计算语言R与用于MS数据分析的软件包一起安装,例如XCMS / metaXCMS和MetabR。 R包Rattle提供了对多种数据挖掘方法的用户友好访问。此外,我们添加了用于编辑大型数据集的非常规电子表格程序茶壶和用于转置大型矩阵的命令行工具。可以使用工作流管理系统(WMS)酒馆集成单个程序,控制台命令和模块。我们将通过实际示例来说明这些工具的有用组合:(1)用于蛋白质鉴定和验证的工作流程,以及随后的肽段关联分析,(2)靶向代谢组学中的聚类分析和数据挖掘,以及(3)原始数据处理,非目标代谢组学中的数据挖掘和代谢物鉴定。关联分析揭示了不同样本集之间变量之间的关系。我们介绍了其在寻找共生肽时的应用,这些共生肽可用于目标蛋白质组学,替代生物标志物的发现以及蛋白质与蛋白质的相互作用。与聚类分析相比,数据挖掘衍生的模型显示出更高的鲁棒性和准确性,可用于在目标代谢组学中对样品组进行分类。随机森林模型不仅提供可用于新数据集的预测模型,而且还提供可变的重要性。我们证明,后者对于追踪重要的信号和非靶向代谢组学中的受影响途径特别有用。因此,随机森林建模支持代谢组学中相关生物学特征的无偏搜索。我们的结果清楚地表明了数据挖掘方法在生物质谱中披露非显而易见信息的重要性。工作流程管理系统的应用以及所有所需程序和数据在一个统一平台中的集成,使所提供的数据分析策略对于非专家用户而言具有可重现性。简单的重新制作过程和MASSyPup64()的开源许可证使系统得以不断改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号