首页> 外文期刊>PLoS Medicine >Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study
【24h】

Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study

机译:深度学习模型在胸部X光片中检测肺炎的可变泛化性能:一项横断面研究

获取原文
           

摘要

Background There is interest in using convolutional neural networks (CNNs) to analyze medical imaging to provide computer-aided diagnosis (CAD). Recent work has suggested that image classification CNNs may not generalize to new data as well as previously believed. We assessed how well CNNs generalized across three hospital systems for a simulated pneumonia screening task. Methods and findings A cross-sectional design with multiple model training cohorts was used to evaluate model generalizability to external sites using split-sample validation. A total of 158,323 chest radiographs were drawn from three institutions: National Institutes of Health Clinical Center (NIH; 112,120 from 30,805 patients), Mount Sinai Hospital (MSH; 42,396 from 12,904 patients), and Indiana University Network for Patient Care (IU; 3,807 from 3,683 patients). These patient populations had an age mean (SD) of 46.9 years (16.6), 63.2 years (16.5), and 49.6 years (17) with a female percentage of 43.5%, 44.8%, and 57.3%, respectively. We assessed individual models using the area under the receiver operating characteristic curve (AUC) for radiographic findings consistent with pneumonia and compared performance on different test sets with DeLong’s test. The prevalence of pneumonia was high enough at MSH (34.2%) relative to NIH and IU (1.2% and 1.0%) that merely sorting by hospital system achieved an AUC of 0.861 (95% CI 0.855–0.866) on the joint MSH–NIH dataset. Models trained on data from either NIH or MSH had equivalent performance on IU (P values 0.580 and 0.273, respectively) and inferior performance on data from each other relative to an internal test set (i.e., new data from within the hospital system used for training data; P values both P = 0.001). To test the effect of pooling data from sites with disparate pneumonia prevalence, we used stratified subsampling to generate MSH–NIH cohorts that only differed in disease prevalence between training data sites. When both training data sites had the same pneumonia prevalence, the model performed consistently on external IU data (P = 0.88). When a 10-fold difference in pneumonia rate was introduced between sites, internal test performance improved compared to the balanced model (10× MSH risk P P = 0.002), but this outperformance failed to generalize to IU (MSH 10× P P = 0.027). CNNs were able to directly detect hospital system of a radiograph for 99.95% NIH (22,050/22,062) and 99.98% MSH (8,386/8,388) radiographs. The primary limitation of our approach and the available public data is that we cannot fully assess what other factors might be contributing to hospital system–specific biases. Conclusion Pneumonia-screening CNNs achieved better internal than external performance in 3 out of 5 natural comparisons. When models were trained on pooled data from sites with different pneumonia prevalence, they performed better on new pooled data from these sites but not on external data. CNNs robustly identified hospital system and department within a hospital, which can have large differences in disease burden and may confound predictions.
机译:背景技术使用卷积神经网络(CNN)来分析医学成像以提供计算机辅助诊断(CAD)引起了人们的兴趣。最近的工作表明,图像分类CNN可能无法像以前一样推广到新数据。我们评估了CNN在三个医院系统中针对模拟性肺炎筛查任务的普遍程度。方法和发现采用具有多个模型训练队列的横断面设计,通过拆分样本验证来评估模型对外部站点的通用性。总共从三个机构绘制了158,323张胸部X光片:美国国立卫生研究院临床中心(NIH; 30,805名患者中的112,120),西奈山医院(MSH; 12,904名患者中的42,396)以及印第安纳大学患者护理网络(IU; 3,807) 3683名患者)。这些患者人群的平均年龄(SD)为46.9岁(16.6),63.2岁(16.5)和49.6岁(17),女性分别为43.5%,44.8%和57.3%。我们使用接收器工作特征曲线(AUC)下方的面积评估了各个模型的影像学结果,发现其与肺炎一致,并将不同测试集的性能与DeLong的测试进行了比较。相对于NIH和IU(1.2%和1.0%),MSH(34.2%)的肺炎患病率足够高,仅按医院系统分类就可在MSH-NIH关节上获得AUC 0.861(95%CI 0.855-0.866)数据集。使用来自NIH或MSH的数据训练的模型在IU上具有同等的性能(分别为P值0.580和0.273),并且相对于内部测试集而言,在彼此之间的数据上具有较差的性能(即,来自用于训练的医院系统内部的新数据)数据; P值均P = 0.001)。为了测试从不同肺炎患病率站点收集数据的效果,我们使用分层亚采样来生成MSH-NIH队列,这些人群之间的疾病患病率仅在训练数据站点之间有所不同。当两个训练数据站点的肺炎患病率相同时,该模型对外部IU数据的执行情况一致(P = 0.88)。当在两个部位之间引入了10倍的肺炎率差异时,内部测试性能比平衡模型有所改善(10倍MSH风险P P = 0.002),但是这种出色的表现未能推广到IU(MSH 10倍P P = 0.027)。 CNN能够直接检测99.95%的NIH(22,050 / 22,062)和99.98%的MSH(8,386 / 8,388)的X光片的医院系统。我们的方法和可用的公共数据的主要局限性在于,我们无法完全评估哪些其他因素可能导致医院系统特定的偏见。结论肺炎筛选CNN在5个自然对照中有3个获得了比内部更好的内部性能。当对来自具有不同肺炎患病率的站点的汇总数据进行模型训练时,它们在来自这些站点的新汇总数据上表现更好,但在外部数据上却表现更好。 CNN会稳健地确定医院内的医院系统和科室,这在疾病负担方面可能有很大差异,并且可能使预测混淆。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号