首页> 美国卫生研究院文献>other >Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data
【2h】

Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data

机译:无监督分类方法中的分类错误。基于目标蛋白质组学数据模拟的比较

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Unsupervised classification methods are gaining acceptance in omics studies of complex common diseases, which are often vaguely defined and are likely the collections of disease subtypes. Unsupervised classification based on the molecular signatures identified in omics studies have the potential to reflect molecular mechanisms of the subtypes of the disease and to lead to more targeted and successful interventions for the identified subtypes. Multiple classification algorithms exist but none is ideal for all types of data. Importantly, there are no established methods to estimate sample size in unsupervised classification (unlike power analysis in hypothesis testing). Therefore, we developed a simulation approach allowing comparison of misclassification errors and estimating the required sample size for a given effect size, number, and correlation matrix of the differentially abundant proteins in targeted proteomics studies. All the experiments were performed in silico. The simulated data imitated the expected one from the study of the plasma of patients with lower urinary tract dysfunction with the aptamer proteomics assay Somascan (SomaLogic Inc, Boulder, CO), which targeted 1129 proteins, including 330 involved in inflammation, 180 in stress response, 80 in aging, etc. Three popular clustering methods (hierarchical, k-means, and k-medoids) were compared. K-means clustering performed much better for the simulated data than the other two methods and enabled classification with misclassification error below 5% in the simulated cohort of 100 patients based on the molecular signatures of 40 differentially abundant proteins (effect size 1.5) from among the 1129-protein panel.
机译:在复杂常见疾病的组学研究中,无监督分类方法正在被人们接受,这些研究通常被模糊地定义,并且很可能是疾病亚型的集合。基于在组学研究中确定的分子特征的无监督分类有可能反映该疾病亚型的分子机制,并导致针对所鉴定亚型的更有针对性和成功的干预措施。存在多种分类算法,但都不是所有类型数据的理想选择。重要的是,没有确定的方法可以在无监督的分类中估计样本量(与假设检验中的功效分析不同)。因此,我们开发了一种模拟方法,可以比较错误分类错误并针对目标蛋白质组学研究中给定的效应量,数量和差异丰富的蛋白质的相关矩阵估算所需的样本量。所有实验均在计算机上进行。模拟数据模仿了使用适体蛋白质组学测定法Somascan(SomaLogic Inc,Boulder,CO)对下尿路功能障碍患者血浆进行研究的预期结果,该方法靶向1129种蛋白质,包括330种涉及炎症,180种应激反应,在衰老等方面有80种。比较了三种流行的聚类方法(分层,k均值和k-medoids)。在100个患者的模拟队列中,基于40种差异丰富的蛋白质(效应大小1.5)的分子特征,K-means聚类对模拟数据的表现比其他两种方法好得多,并且能够进行分类,误分类误差低于5%。 1129蛋白面板

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号