首页> 外文期刊>BMC Bioinformatics >A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration
【24h】

A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration

机译:使用数据集成评估标识符映射和过滤方法的决策理论范式

获取原文
       

摘要

Background In bioinformatics, we pre-process raw data into a format ready for answering medical and biological questions. A key step in processing is labeling the measured features with the identities of the molecules purportedly assayed: “ molecular identification ” ( MI ). Biological meaning comes from identifying these molecular measurements correctly with actual molecular species. But MI can be incorrect. Identifier filtering ( IDF ) selects features with more trusted MI, leaving a smaller, but more correct dataset. Identifier mapping ( IDM ) is needed when an analyst is combining two high-throughput (HT) measurement platforms on the same samples. IDM produces ID pairs, one ID from each platform, where the mapping declares that the two analytes are associated through a causal path, direct or indirect (example: pairing an ID for an mRNA species with an ID for a protein species that is its putative translation). Many competing solutions for IDF and IDM exist. Analysts need a rigorous method for evaluating and comparing all these choices. Results We describe a paradigm for critically evaluating and comparing IDF and IDM methods, guided by data on biological samples. The requirements are: a large set of biological samples, measurements on those samples from at least two high-throughput platforms, a model family connecting features from the platforms, and an association measure. From these ingredients, one fits a mixture model coupled to a decision framework. We demonstrate this evaluation paradigm in three settings: comparing performance of several bioinformatics resources for IDM between transcripts and proteins, comparing several published microarray probeset IDF methods and their combinations, and selecting optimal quality thresholds for tandem mass spectrometry spectral events. Conclusions The paradigm outlined here provides a data-grounded approach for evaluating the quality not just of IDM and IDF, but of any pre-processing step or pipeline. The results will help researchers to semantically integrate or filter data optimally, and help bioinformatics database curators to track changes in quality over time and even to troubleshoot causes of MI errors.
机译:背景技术在生物信息学中,我们将原始数据预处理为可用于回答医学和生物学问题的格式。处理中的关键步骤是用据称被分析的分子的身份标记被测特征:“分子识别”(MI)。生物学意义来自正确识别具有实际分子种类的这些分子测量值。但是MI可能不正确。标识符过滤(IDF)选择具有更受信任的MI的功能,从而保留较小但更正确的数据集。当分析人员在同一样本上组合两个高通量(HT)测量平台时,需要使用标识符映射(IDM)。 IDM生成ID对,每个平台有一个ID,在该映射中声明两种分析物是通过直接或间接的因果路径关联的(例如:将mRNA种类的ID与假定的蛋白质种类的ID配对翻译)。存在许多针对IDF和IDM的竞争解决方案。分析师需要一种严格的方法来评估和比较所有这些选择。结果我们描述了一种以生物样本数据为指导,严格评估和比较IDF和IDM方法的范例。要求是:大量生物样品,至少两个高通量平台对这些样品的测量,连接平台特征的模型族以及关联度量。从这些成分中,可以得出一个适合决策模型的混合模型。我们在三种环境中展示了这种评估范例:比较转录本和蛋白质之间IDM的几种生物信息学资源的性能,比较几种已发表的微阵列探针集IDF方法及其组合,以及为串联质谱法光谱事件选择最佳质量阈值。结论这里概述的范例为评估IDM和IDF以及任何预处理步骤或流水线的质量提供了一种基于数据的方法。这些结果将帮助研究人员在语义上进行最佳整合或过滤,并帮助生物信息学数据库管理者跟踪质量随时间的变化,甚至解决MI错误的原因。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号