首页> 外文期刊>Molecular BioSystems >Optimisation of cancer classification by machine learning generates an enriched list of candidate drug targets and biomarkers
【24h】

Optimisation of cancer classification by machine learning generates an enriched list of candidate drug targets and biomarkers

机译:通过机器学习的癌症分类优化产生丰富的候选药物目标和生物标志物列表

获取原文
获取原文并翻译 | 示例
           

摘要

The Cancer Genome Atlas has provided expression values of 18015 genes for different cancer types. Studies on the classification of cancers by machine learning algorithms have used different data and methods, which makes it difficult to compare their performance. It is unclear, which algorithm performs best and if maximum levels of accuracy have been obtained. In this study, we aimed to optimise the diagnosis of cancer by comparing the performance of five algorithms using the same data, and by identifying the smallest possible number of differentiator genes. Classification accuracies of five algorithms of cancer type and primary site were determined using a gene expression dataset of 5629 samples and a dataset of 9144 samples, respectively. When trained with sample sets ranging from 16 718 to 40 genes, Random Forest (RF), Gradient Boosting Machine (GBM), and Neural Network (NN) consistently achieved 100% or near 100% accuracy in the classification of both cancer type and primary site. Reduction of training sets to the 40 highest-ranked genes resulted in 78-fold and 45-fold faster processing times for RF and GBM, respectively. The olfactory receptor family, keratin associated proteins, and defensin beta family were among the highest ranked genes. The ensemble and NN algorithms were the most accurate at distinguishing between cancer types and primary sites, whereas KNN was the fastest. Training sets can be reduced to the 40 highest-ranked differentiator genes without any significant loss of accuracy, amongst which there are potential drug targets and biomarkers.
机译:癌症基因组Atlas为不同癌症类型提供了18015个基因的表达值。通过机器学习算法对癌症分类的研究使用了不同的数据和方法,这使得难以比较它们的性能。目前尚不清楚,哪种算法表现最佳,并且如果获得了最大程度的准确度。在这项研究中,我们旨在通过使用相同数据的比较五种算法的性能来优化癌症的诊断,并通过鉴定最小的不同蛋白基因数量。使用5629个样品的基因表达数据集和9144个样品的数据集,测定五种癌症类型和原发性部位算法的分类精度。当用来自16 718至40个基因的样本集培训时,随机森林(RF),梯度升压机(GBM)和神经网络(NN)在癌症类型和主要的分类中始终如一地实现100%或接近100%的准确性地点。减少训练集到40个最高排名基因导致RF和GBM的78倍和45倍的加工时间。嗅觉受体家庭,角蛋白相关蛋白质和防御蛋白酶是排名最高的基因。该集合和NN算法是区分癌症类型和主要场所的最准确,而KNN是最快的。训练集可以减少到40个最高排名的差异蛋白基因,没有任何显着的准确性损失,其中有潜在的药物靶标和生物标志物。

著录项

  • 来源
    《Molecular BioSystems》 |2020年第2期|113-125|共13页
  • 作者单位

    Department of Electrical and Computer Engineering University of the West Indies Saint Augustine Trinidad and Tobago;

    Department of Electrical and Computer Engineering University of the West Indies Saint Augustine Trinidad and Tobago;

    Department of Pre-Clinical Sciences University of the West Indies Saint Augustine Trinidad and Tobago;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号