首页> 外文期刊>Expert Systems with Application >Unbalanced breast cancer data classification using novel fitness functions in genetic programming
【24h】

Unbalanced breast cancer data classification using novel fitness functions in genetic programming

机译:在基因编程中使用新型适应度函数对乳腺癌数据进行不平衡分类

获取原文
获取原文并翻译 | 示例

摘要

Breast Cancer is a common disease and to prevent it, the disease must be identified at earlier stages. Available breast cancer datasets are unbalanced in nature, i.e. there are more instances of benign (non-cancerous) cases then malignant (cancerous) ones. Therefore, it is a challenging task for most machine learning (ML) models to classify between benign and malignant cases properly, even though they have high accuracy. Accuracy is not a good metric to assess the results of ML models on breast cancer dataset because of biased results. To address this issue, we use Genetic Programming (GP) and propose two fitness functions. First one is F2 score which focuses on learning more about the minority class, which contains more relevant information, the second one is a novel fitness function known as Distance score (D score) which learns about both the classes by giving them equal importance and being unbiased. The GP framework in which we implemented D score is named as D-score GP (DGP) and the framework implemented with F2 score is named as F2GP. The proposed F2GP achieved a maximum accuracy of 99.63%, 99.51% and 100% for 60-40, 70-30 partition schemes and 10 fold cross validation scheme respectively and DGP achieves a maximum accuracy of 99.63%, 98.5% and 100% in 60-40, 70-30 partition schemes and 10 fold cross validation scheme respectively. The proposed models also achieves a recall of 100% for all the test cases. This shows that using a new fitness function for unbalanced data classification improves the performance of a classifier. (C) 2019 Elsevier Ltd. All rights reserved.
机译:乳腺癌是一种常见疾病,为预防这种疾病,必须在早期阶段识别出该疾病。可用的乳腺癌数据集本质上是不平衡的,即,良性(非癌性)病例比恶性(癌性)病例更多。因此,对于大多数机器学习(ML)模型来说,即使它们具有较高的准确性,如何在良性和恶性病例之间进行正确分类也是一项艰巨的任务。由于结果存在偏差,因此准确性不是评估乳腺癌数据集上的ML模型结果的好指标。为了解决此问题,我们使用遗传编程(GP)并提出两个适应度函数。第一个是F2分数,重点是学习有关少数群体的更多信息,其中包含更多相关信息;第二个是称为距离分数(D分数)的新颖适应性函数,该函数通过赋予两个同等的重要性和正当性来了解这两个类别公正的。我们在其中实施D评分的GP框架称为D分数GP(DGP),而在F2评分中实施的框架称为F2GP。拟议的F2GP在60-40、70-30分区方案和10倍交叉验证方案中分别达到了99.63%,99.51%和100%的最大精度,而DGP在60中达到了99.63%,98.5%和100%的最大精度。 -40、70-30分区方案和10倍交叉验证方案。所提出的模型还为所有测试用例实现了100%的召回率。这表明对不平衡数据分类使用新的适应度函数可以提高分类器的性能。 (C)2019 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号