首页> 外文会议>International Conference on Big Data;Services Conference Federation >A Machine Learning Approach to Prostate Cancer Risk Classification Through Use of RNA Sequencing Data
【24h】

A Machine Learning Approach to Prostate Cancer Risk Classification Through Use of RNA Sequencing Data

机译:通过使用RNA测序数据的机器学习方法进行前列腺癌风险分类

获取原文

摘要

Advancements in RNA sequencing technology have made genomic data acquired during sequencing more precise, making models fitted to sequencing data more practical. Previous studies conducted regarding prostate cancer diagnosis have been limited to microarray data, with limited successes. We utilized The Cancer Genome Atlas' (TCGA) prostate cancer sequencing data to test the viability of fitting machine learning models to RNA sequencing data. A major challenge associated with the sequencing data is its high dimensionality. In this research, we addressed two complementary tasks. The first was to identify genes most associated with potential cancer. We started by using the mutual information metric to identify the most significant genes. Furthermore, we applied the Recursive Feature Elimination (RFE) algorithm to reduce the number of genes needed to identify cancer. The second task was to create a classification model to separate potential high-risk patients from the healthy ones. For the second task, we combated the high dimensionality challenge with Principal Component Analysis (PCA). In addition to high dimensionality, another challenge is the imbalanced data set that has a 10:1 class imbalance of cancerous and healthy tissue respectively. To combat this problem, we used the Synthetic Minority Oversampling Technique (SMOTE) to create synthetic observations and equalize the class distribution. We trained and tested a logistic regression model using 5-fold cross-validation. The results were promising, significantly reducing the false negative rate as compared to current diagnostic techniques while still keeping the false positive rate low. The model showed great improvements over previous machine learning attempts to diagnose prostate cancer. Our model could be applied as part of the patient diagnosis pipeline, helping to improve accuracy.
机译:RNA测序技术的进步使测序过程中获得的基因组数据更加精确,使适合测序数据的模型更加实用。以前有关前列腺癌诊断的研究仅限于微阵列数据,但成功率有限。我们利用《癌症基因组图谱》(TCGA)前列腺癌测序数据来测试将机器学习模型拟合到RNA测序数据的可行性。与测序数据相关的主要挑战是其高维数。在这项研究中,我们解决了两个互补的任务。首先是确定与潜在癌症最相关的基因。我们首先使用互信息量度来识别最重要的基因。此外,我们应用了递归特征消除(RFE)算法来减少识别癌症所需的基因数量。第二项任务是创建一个分类模型,以将潜在的高风险患者与健康患者区分开。对于第二项任务,我们使用主成分分析(PCA)来应对高维挑战。除了高维之外,另一个挑战是数据集不平衡,该数据集分别具有癌组织和健康组织10:1类别的不平衡。为了解决这个问题,我们使用了综合少数族裔过采样技术(SMOTE)来创建综合观测值并均衡类分布。我们使用5倍交叉验证训练并测试了逻辑回归模型。结果令人鼓舞,与当前的诊断技术相比,显着降低了假阴性率,同时仍保持了低的假阳性率。该模型比以前的诊断前列腺癌的机器学习尝试显示出了很大的进步。我们的模型可以用作患者诊断流程的一部分,从而有助于提高准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号