首页> 外文OA文献 >Statistical learning methods for multi-omics data integration in dimension reduction, supervised and unsupervised machine learning
【2h】

Statistical learning methods for multi-omics data integration in dimension reduction, supervised and unsupervised machine learning

机译:用于降维,有监督和无监督机器学习的多组学数据集成的统计学习方法

摘要

Over the decades, many statistical learning techniques such as supervised learning, unsupervised learning, dimension reduction technique have played ground breaking roles for important tasks in biomedical research. More recently, multi-omics data integration analysis has become increasingly popular to answer to many intractable biomedical questions, to improve statistical power by exploiting large size samples and different types omics data, and to replicate individual experiments for validation. This dissertation covers the several analytic methods and frameworks to tackle with practical problems in multi-omics data integration analysis.udSupervised prediction rules have been widely applied to high-throughput omics data to predict disease diagnosis, prognosis or survival risk. The top scoring pair (TSP) algorithm is a supervised discriminant rule that applies a robust simple rank-based algorithm to identify rank-altered gene pairs in case/control classes. TSP usually generates greatly reduced accuracy in inter-study prediction (i.e., the prediction model is established in the training study and applied to an independent test study). In the first part, we introduce a MetaTSP algorithm that combines multiple transcriptomic studies and generates a robust prediction model applicable to independent test studies.udOne important objective of omics data analysis is clustering unlabeled patients in order to identify meaningful disease subtypes. In the second part, we propose a group structured integrative clustering method to incorporate a sparse overlapping group lasso technique and a tight clustering via regularization to integrate inter-omics regulation flow, and to encourage outlier samples scattering away from tight clusters. We show by two real examples and simulated data that our proposed methods improve the existing integrative clustering in clustering accuracy, biological interpretation, and are able to generate coherent tight clusters.udPrincipal component analysis (PCA) is commonly used for projection to low-dimensional space for visualization. In the third part, we introduce two meta-analysis frameworks of PCA (Meta-PCA) for analyzing multiple high-dimensional studies in common principal component space. Theoretically, Meta-PCA specializes to identify meta principal component (Meta-PC) space; (1) by decomposing the sum of variances and (2) by minimizing the sum of squared cosines. Applications to various simulated data shows that Meta-PCAs outstandingly identify true principal component space, and retain robustness to noise features and outlier samples. We also propose sparse Meta-PCAs that penalize principal components in order to selectively accommodate significant principal component projections. With several simulated and real data applications, we found Meta-PCA efficient to detect significant transcriptomic features, and to recognize visual patterns for multi-omics data sets.udIn the future, the success of data integration analysis will play an important role in revealing the molecular and cellular process inside multiple data, and will facilitate disease subtype discovery and characterization that improve hypothesis generation towards precision medicine, and potentially advance public health research.
机译:几十年来,许多统计学习技术,例如监督学习,无监督学习,降维技术,在生物医学研究中的重要任务中发挥了开创性的作用。最近,多组学数据集成分析已变得越来越普遍,可以回答许多棘手的生物医学问题,通过利用大型样本和不同类型的组学数据来提高统计能力,并复制单个实验进行验证。本论文涵盖了解决多组学数据集成分析中实际问题的几种分析方法和框架。 ud有监督的预测规则已广泛应用于高通量组学数据,以预测疾病的诊断,预后或生存风险。最高得分对(TSP)算法是一种监督判别规则,该规则应用鲁棒的,简单的基于排名的算法来识别案例/对照类别中排名改变的基因对。 TSP通常会大大降低跨研究预测的准确性(即在训练研究中建立预测模型并将其应用于独立的测试研究)。在第一部分中,我们介绍了一种MetaTSP算法,该算法结合了多个转录组学研究并生成了适用于独立测试研究的可靠预测模型。 udomics数据分析的一个重要目标是对未标记的患者进行聚类以识别有意义的疾病亚型。在第二部分中,我们提出了一种组结构化集成聚类方法,该方法将稀疏重叠组套索技术和通过正则化的紧密聚类相结合,以整合组学组间的调节流,并鼓励离群样本从紧密聚类中散开。我们通过两个真实的例子和模拟数据表明,我们提出的方法在聚类准确性,生物学解释方面改善了现有的集成聚类,并且能够生成连贯的紧密聚类。 ud主成分分析(PCA)通常用于投影到低维可视化空间。在第三部分中,我们介绍了PCA的两个元分析框架(Meta-PCA),用于分析公共主成分空间中的多个高维研究。从理论上讲,Meta-PCA专门用于识别元主成分(Meta-PC)空间; (1)分解方差之和,(2)最小化余弦平方和。对各种模拟数据的应用表明,Meta-PCA能够出色地识别出真正的主成分空间,并保持了对噪声特征和异常样本的鲁棒性。我们还提出了对主成分进行惩罚的稀疏Meta-PCA,以便有选择地适应重要的主成分预测。在一些模拟和真实数据应用程序中,我们发现Meta-PCA可有效检测重要的转录组特征,并识别多组学数据集的可视模式。 ud在将来,数据集成分析的成功将在揭示数据集成方面发挥重要作用。多个数据中的分子和细胞过程,将促进疾病亚型的发现和表征,从而改善对精密医学的假设生成,并有可能推动公共卫生研究。

著录项

  • 作者

    Kim SungHwan;

  • 作者单位
  • 年度 2015
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类
  • 入库时间 2022-08-31 15:10:47

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号