首页> 外文期刊>BMC Bioinformatics >Machine learning with the TCGA-HNSC dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality
【24h】

Machine learning with the TCGA-HNSC dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality

机译:使用TCGA-HNSC数据集进行机器学习:通过解决不一致,稀疏性和高维度来提高可用性

获取原文
       

摘要

In the era of precision oncology and publicly available datasets, the amount of information available for each patient case has dramatically increased. From clinical variables and PET-CT radiomics measures to DNA-variant and RNA expression profiles, such a wide variety of data presents a multitude of challenges. Large clinical datasets are subject to sparsely and/or inconsistently populated fields. Corresponding sequencing profiles can suffer from the problem of high-dimensionality, where making useful inferences can be difficult without correspondingly large numbers of instances. In this paper we report a novel deployment of machine learning techniques to handle data sparsity and high dimensionality, while evaluating potential biomarkers in the form of unsupervised transformations of RNA data. We apply preprocessing, MICE imputation, and sparse principal component analysis (SPCA) to improve the usability of more than 500 patient cases from the TCGA-HNSC dataset for enhancing future oncological decision support for Head and Neck Squamous Cell Carcinoma (HNSCC). Imputation was shown to improve prognostic ability of sparse clinical treatment variables. SPCA transformation of RNA expression variables reduced runtime for RNA-based models, though changes to classifier performance were not significant. Gene ontology enrichment analysis of gene sets associated with individual sparse principal components (SPCs) are also reported, showing that both high- and low-importance SPCs were associated with cell death pathways, though the high-importance gene sets were found to be associated with a wider variety of cancer-related biological processes. MICE imputation allowed us to impute missing values for clinically informative features, improving their overall importance for predicting two-year recurrence-free survival by incorporating variance from other clinical variables. Dimensionality reduction of RNA expression profiles via SPCA reduced both computation cost and model training/evaluation time without affecting classifier performance, allowing researchers to obtain experimental results much more quickly. SPCA simultaneously provided a convenient avenue for consideration of biological context via gene ontology enrichment analysis.
机译:在精密肿瘤学和公共数据集的时代,每个患者案例可用的信息量大幅增加。从临床变量和PET-CT辐射瘤对DNA变异和RNA表达谱的措施,这种多种数据具有多种挑战。大型临床数据集受到稀疏和/或不一致的填充领域。相应的测序轮廓可以遭受高维的问题,其中在没有相应大量的情况下,在没有相应的大量情况下进行有用的推论。在本文中,我们报告了一种新颖的机器学习技术部署,以处理数据稀疏性和高维度,同时评估潜在的RNA数据转化形式的潜在生物标志物。我们应用预处理,小鼠归档和稀疏主成分分析(SPCA),以提高TCGA-HNSC数据集500多种患者病例的可用性,以增强对头和颈鳞状细胞癌(HNSCC)的未来肿瘤决策支持。显示出现归档以提高稀疏临床治疗变量的预后能力。 RNA表达变量的SPCA转换减少了基于RNA的模型的运行时,但更改为分类器性能并不重要。还报道了与个体稀疏主成分(SPC)相关的基因本体浓缩分析,表明高度和低于低于和低重要的SPC与细胞死亡途径有关,尽管发现高重要的基因集与各种各样的癌症相关的生物过程。小鼠的归纳使我们能够通过掺入来自其他临床变量的差异来改善临床信息特征的缺失值,从而提高他们对预测两年的无复发生存率的总体重要性。通过SPCA的RNA表达分布的维度降低减少了计算成本和模型训练/评估时间而不影响分类器性能,允许研究人员更快地获得实验结果。 SPCA同时提供了一种方便的大道,用于通过基因本体浓缩分析考虑生物学背景。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号