首页> 外文会议>International Conference on Big Data;Services Conference Federation >Cross-Cancer Genome Analysis on Cancer Classification Using Both Unsupervised and Supervised Approaches
【24h】

Cross-Cancer Genome Analysis on Cancer Classification Using Both Unsupervised and Supervised Approaches

机译:无监督和监督方法的跨癌基因组分析癌症分类

获取原文

摘要

Many problems exist within the current cancer diagnosis pipeline, one of which is alarmingly high over-diagnosis rates in breast, prostate, and lung cancer. Through quantifying gene expression levels, next-generation sequencing techniques such as RNA-Seq offer an opportunity for researchers and clinicians to gain a more complete view of a cell's transcriptome. With the adoption of this new data source, cross-cancer methods for cancer diagnosis have become more viable. We utilize mutual information in conjunction with a Gaussian mixture model and t-SNE to evaluate the separability of cancer and non-cancer tissue samples from RNA-Seq expression data. The Gaussian mixture and t-SNE combination produced clear clustering without supervision, suggesting the ability to separate tissue samples algorithmically. Afterwards, we use a collection of deep neural networks to classify tissue origin and status from tissue sample gene expressions. We use genes selected based on the prior mutual information technique. First, we select the top 500 genes from candidate genes without considerations for overlap in the predictability of those genes. We then applied Recursive Feature Elimination (RFE) to select 200 genes, thus accounting for covariation. We find that the performance using the top 500 genes is only slightly better than the 200 genes selected using RFE, and the two approaches achieved similar performance overall, indicating that only a small subset of genes is required for the identification of status and origin. This work indicates that RNA sequencing data is a useful tool for cross-cancer studies. Next steps include the implementation of a greater amount of non-cancer data from other datasets to decrease bias in model training.
机译:目前的癌症诊断管道内存在许多问题,其中一个存在于乳腺癌,前列腺和肺癌中的过度诊断率。通过量化基因表达水平,诸如RNA-SEQ的下一代测序技术为研究人员和临床医生提供了一个更完整的细胞转录组的观察。随着采用这种新的数据源,癌症诊断的跨癌方法变得更加可行。我们利用相互信息与高斯混合模型和T-SNE一起评估来自RNA-SEQ表达数据的癌症和非癌症组织样本的可分离性。高斯混合物和T-SNE组合在没有监督的情况下产生明确的聚类,这表明能够分离组织样本算法。然后,我们使用深神经网络的集合来分类组织来源和来自组织样本基因表达的状态。我们使用基于先前的互信息技术选择的基因。首先,我们选择来自候选基因的前500个基因,而不考虑在这些基因的可预测性中的重叠。然后,我们应用递归特征消除(RFE)选择200个基因,从而占协变的核算。我们发现,使用前500个基因的性能仅略好于使用RFE选择的200个基因,并且两种方法总体上实现了类似的性能,表明只需要识别状态和原点所需的小基因子集。这项工作表明RNA测序数据是跨癌研究的有用工具。下一步包括从其他数据集实现更大量的非癌症数据,以减少模型训练中的偏差。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号