首页> 外文会议>International Conference on Intelligent Computing and Control Systems >A Comparative Study for Breast Cancer Prediction using Machine Learning and Feature Selection
【24h】

A Comparative Study for Breast Cancer Prediction using Machine Learning and Feature Selection

机译:基于机器学习和特征选择的乳腺癌预测比较研究

获取原文

摘要

While there are many factors which could contribute to the occurrence of breast cancer, it is very difficult to attribute the exact environmental and other factors contributing to it, but still it has significance in determining the occurrence of cancer. Using machine learning techniques and regular diagnosis information, we can achieve our goal of assessing the risk of occurrence of breast cancer. Cancer data sets contain many attributes of patient information, but not every feature is relevant in predicting cancer. Feature selection techniques are useful in such scenarios for retaining the relevant feature set. In this paper we are doing a comparative study of the effect of feature selection techniques on the accuracies given by existing machine learning algorithms. For this purpose we have considered the following machine learning algorithms - Logistic Regression, Naive Bayes and Random Forest. The following feature selection techniques have been considered - Sequential Forward Feature Selection, Recursive Feature Elimination, f-test and correlation.The publicly available Breast Cancer Wisconsin (Diagnostic) Data Sets from UCI Repository have been used in this work. The results show that random forest algorithm gives the highest accuracy with feature selection. Furthermore f-test gives better results for the smaller dataset and Sequential Forward Selection for the larger dataset.
机译:尽管有许多因素可能导致乳腺癌的发生,但是很难确切地归因于造成乳腺癌的确切环境因素和其他因素,但是在确定癌症的发生方面仍然具有重要意义。使用机器学习技术和定期的诊断信息,我们可以实现评估乳腺癌发生风险的目标。癌症数据集包含患者信息的许多属性,但并非每个功能都与预测癌症相关。在这种情况下,特征选择技术对于保留相关特征集很有用。在本文中,我们正在对特征选择技术对现有机器学习算法所给出的准确性的影响进行比较研究。为此,我们考虑了以下机器学习算法-Logistic回归,朴素贝叶斯和随机森林。考虑了以下特征选择技术-顺序正向特征选择,递归特征消除,f检验和相关性。这项工作使用了UCI知识库中公开的乳腺癌威斯康星州(诊断)数据集。结果表明,随机森林算法在特征选择方面具有最高的准确性。此外,对于较小的数据集,f检验可提供更好的结果;对于较大的数据集,顺序检验可提供更好的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号