首页> 外文会议>Medical Imaging Conference >Hazards of data leakage in machine learning: A study on classification of breast cancer using deep neural networks
【24h】

Hazards of data leakage in machine learning: A study on classification of breast cancer using deep neural networks

机译:机器学习中数据泄漏的危害:使用深度神经网络对乳腺癌进行分类的研究

获取原文
获取外文期刊封面目录资料

摘要

With the renewed interest in developing machine learning methods for medical imaging using deep-learning approaches, it is essential to reexamine data leakage. In this study, we simulated data leakage in the form of feature leakage, where a classifier was trained on the training set, but the feature selection was influenced by the performance on the validation set. A pre-trained deep-learning convolutional neural network (DCNN) without fine-tuning was used as a feature extractor for malignant and benign mass classification in mammography. A feature selection algorithm was trained in the wrapper mode with a cost function tuned to follow the performance metric on the validation set. Linear discriminant analysis (LDA) classifier was trained to classify masses on mammographic patches. Mammograms from 1,882 patient cases with 4,577 unique patches were partitioned by patient into 3,222 for training and 508 for validation, while 847 were sequestered as unseen independent test set to evaluate the generalization error. The effects of the finite sample size on data leakage were studied by varying the training and validation set sizes from 10% to 100% of the available sets. The area under the receiver operating characteristic curve (AUC) was used as the performance metric. The results show that the performance on the validation set could be overestimated, having AUCs of 0.75 to 0.99 for various sample sizes, whereas the independent test performance could realistically only reach an AUC of 0.72. The analysis indicates that deep learning can risk a high inflation in performance and proper housekeeping rules should be followed when designing and developing deep learning methods in medical imaging.
机译:随着人们对使用深度学习方法开发用于医学成像的机器学习方法重新产生兴趣,重新检查数据泄漏至关重要。在这项研究中,我们以特征泄漏的形式模拟了数据泄漏,其中在训练集上对分类器进行了训练,但是特征选择受验证集上的性能影响。没有经过微调的预训练深度学习卷积神经网络(DCNN)被用作乳腺X线摄影中恶性和良性肿块分类的特征提取器。在包装器模式下训练了特征选择算法,并调整了成本函数以遵循验证集上的性能指标。训练了线性判别分析(LDA)分类器,以对乳腺X线片上的肿块进行分类。将来自1882例患者的X线照片(具有4577个独特的补丁)划分为3222例用于训练,508例用于验证,而847例被隔离为看不见的独立测试集,以评估泛化误差。通过将训练和验证集的大小从可用集的10%更改为100%,研究了有限样本量对数据泄漏的影响。接收器工作特性曲线(AUC)下的面积用作性能指标。结果表明,验证集上的性能可能被高估了,各种样本量的AUC为0.75至0.99,而独立测试的性能实际上只能达到0.72。分析表明,深度学习可能会导致性能大幅提高,因此在设计和开发医学成像深度学习方法时,应遵循适当的内务处理规则。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号