...
【24h】

Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines

机译:使用最近邻和支持向量机从不平衡数据预测人类乳腺癌和结肠癌

获取原文
获取原文并翻译 | 示例
           

摘要

This study proposes a novel prediction approach for human breast and colon cancers using different feature spaces. The proposed scheme consists of two stages: the preprocessor and the predictor. In the preprocessor stage, the mega-trend diffusion (MTD) technique is employed to increase the samples of the minority class, thereby balancing the dataset. In the predictor stage, machine-learning approaches of K-nearest neighbor (KNN) and support vector machines (SVM) are used to develop hybrid MTD-SVM and MTD-KNN prediction models. MTD-SVM model has provided the best values of accuracy, G-mean and Matthew's correlation coefficient of 96.71%, 96.70% and 71.98% for canceron-cancer dataset, breaston-breast cancer dataset and colonon-colon cancer dataset, respectively. We found that hybrid MTD-SVM is the best with respect to prediction performance and computational cost. MTD-KNN model has achieved moderately better prediction as compared to hybrid MTD-NB (Na?ve Bayes) but at the expense of higher computing cost. MTD-KNN model is faster than MTD-RF (random forest) but its prediction is not better than MTD-RF. To the best of our knowledge, the reported results are the best results, so far, for these datasets. The proposed scheme indicates that the developed models can be used as a tool for the prediction of cancer. This scheme may be useful for study of any sequential information such as protein sequence or any nucleic acid sequence.
机译:这项研究提出了一种使用不同特征空间的人类乳腺癌和结肠癌的新颖预测方法。所提出的方案包括两个阶段:预处理器和预测器。在预处理阶段,采用大趋势扩散(MTD)技术来增加少数派样本的数量,从而平衡数据集。在预测器阶段,使用K最近邻(KNN)和支持向量机(SVM)的机器学习方法来开发混合MTD-SVM和MTD-KNN预测模型。 MTD-SVM模型为癌症/非癌症数据集,乳腺癌/非乳腺癌数据集和结肠/非结肠癌数据集提供了最佳的准确性值,G均值和Matthew相关系数为96.71%,96.70%和71.98%数据集。我们发现,就预测性能和计算成本而言,混合MTD-SVM是最好的。与混合MTD-NB(朴素贝叶斯)相比,MTD-KNN模型已取得了较好的预测,但代价是计算成本较高。 MTD-KNN模型比MTD-RF(随机森林)快,但其预测并不比MTD-RF好。据我们所知,对于这些数据集,报告的结果是迄今为止最好的结果。提出的方案表明,开发的模型可以用作预测癌症的工具。该方案可用于研究任何顺序信息,例如蛋白质序列或任何核酸序列。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号