首页> 外文期刊>Computers in Biology and Medicine >Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values
【24h】

Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values

机译:离散值未知的乳腺癌患者5年生存预测的数据推算缺失

获取原文
获取原文并翻译 | 示例
           

摘要

Breast cancer is the most frequently diagnosed cancer in women. Using historical patient information stored in clinical datasets, data mining and machine learning approaches can be applied to predict the survival of breast cancer patients. A common drawback is the absence of information, i.e., missing data, in certain clinical trials. However, most standard prediction methods are not able to handle incomplete samples and, then, missing data imputation is a widely applied approach for solving this inconvenience. Therefore, and taking into account the characteristics of each breast cancer dataset, it is required to perform a detailed analysis to determine the most appropriate imputation and prediction methods in each clinical environment This research work analyzes a real breast cancer dataset from Institute Portuguese of Oncology of Porto with a high percentage of unknown categorical information (most clinical data of the patients are incomplete), which is a challenge in terms of complexity. Four scenarios are evaluated: (I) 5-year survival prediction without imputation and 5-year survival prediction from cleaned dataset with (II) Mode imputation, (Ill) Expectation-Maximization imputation and (IV) K-Nearest Neighbors imputation. Prediction models for breast cancer survivability are constructed using four different methods: K-Nearest Neighbors, Classification Trees, Logistic Regression and Support Vector Machines. Experiments are performed in a nested ten-fold cross-validation procedure and, according to the obtained results, the best results are provided by the K-Nearest Neighbors algorithm: more than 81% of accuracy and more than 0.78 of area under the Receiver Operator Characteristic curve, which constitutes very good results in this complex scenario. (C) 2015 Elsevier Ltd. All rights reserved.
机译:乳腺癌是女性中最常被诊断出的癌症。使用存储在临床数据集中的历史患者信息,数据挖掘和机器学习方法可以应用于预测乳腺癌患者的生存。一个常见的缺点是在某些临床试验中缺少信息,即缺少数据。但是,大多数标准的预测方法无法处理不完整的样本,因此,缺失数据插补是解决这种不便的一种广泛应用的方法。因此,考虑到每个乳腺癌数据集的特征,需要进行详细的分析,以确定每种临床环境中最合适的归因和预测方法。这项研究工作分析了来自葡萄牙肿瘤研究所的真实乳腺癌数据集。波尔图具有大量未知分类信息(患者的大多数临床数据不完整),这在复杂性方面是一个挑战。评估了四种情况:(I)不进行插补的5年生存预测和使用(II)模式插补,(Ill)期望最大化插补和(IV)K最近邻插补的清理数据集的5年生存预测。使用四种不同的方法构建乳腺癌生存能力的预测模型:K最近邻,分类树,Logistic回归和支持向量机。实验是在嵌套的十倍交叉验证过程中进行的,根据获得的结果,K最近邻算法可提供最佳结果:接收器算符下的准确度超过81%,面积超过0.78特征曲线,在这种复杂的情况下构成非常好的结果。 (C)2015 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号