Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data

Yasmin H. Karimi; Douglas W. Blayney; Allison W. KurianJeanne ShenRikiya YamashitaDaniel Rubin

摘要

PURPOSE Large-scale analysis of real-world evidence is often limited to structured data fields that do not contain reliable information on recurrence status and disease sites. In this report, we describe a natural language processing (NLP) framework that uses data from free-text, unstructured reports to classify recurrence status and sites of recurrence for patients with breast and hepatocellular carcinomas (HCC).METHODS Using two cohorts of breast cancer and HCC cases, we validated the ability of a previously developed NLP model to distinguish between no recurrence, local recurrence, and distant recurrence, based on clinician notes, radiology reports, and pathology reports compared with manual curation. A second NLP model was trained and validated to identify sites of recurrence. We compared the ability of each NLP model to identify the presence, timing, and site of recurrence, when compared against manual chart review and International Classification of Diseases coding.RESULTS A total of 1,273 patients were included in the development and validation of the two models. The NLP model for recurrence detects distant recurrence with an area under the curve of 0.98 (95% Cl, 0.96 to 0.99) and 0.95 (95% Cl, 0.88 to 0.98) in breast and HCC cohorts, respectively. The mean accuracy of the NLP model for detecting any site of distant recurrence was 0.9 for breast cancer and 0.83 for HCC. The NLP model for recurrence identified a larger proportion of patients with distant recurrence in a breast cancer database (11.1%) compared with International Classification of Diseases coding (2.31%).CONCLUSION We developed two NLP models to identify distant cancer recurrence, timing of recurrence, and sites of recurrence based on unstructured electronic health record data. These models can be used to perform large-scale retrospective studies in oncology.

机译：目的是对现实世界证据的大规模分析通常仅限于不包含有关复发状态和疾病部位的可靠信息的结构化数据字段。在本报告中，我们描述了一种自然语言处理（NLP）框架，该框架使用来自自由文本的非结构化报告中的数据来对乳腺癌和肝细胞癌患者（HCC）的复发状态和复发部位进行分类。和HCC病例，我们验证了先前开发的NLP模型根据临床医生注释，放射学报告和病理学报告与手动策划相比，基于临床医生注释，基于临床医生注释，不再重复发生，局部复发和远处复发的能力。对第二个NLP模型进行了训练和验证，以识别复发位点。我们比较了每个NLP模型确定复发的存在，时机和现场的能力，与手动图表审查和疾病编码的国际分类进行了比较。重新分析了两个模型的开发和验证中总共包括1,273名患者。复发的NLP模型分别检测到曲线下的远处复发（95％Cl，0.96至0.99）和0.95（95％Cl，0.88至0.98），分别在乳房和HCC队列中。 NLP模型检测任何远处复发部位的平均准确性对于乳腺癌为0.9，HCC的平均复发位点为0.83。复发的NLP模型与国际疾病编码（2.31％）相比，乳腺癌数据库中远处复发的患者比例较大（11.1％）。结论我们开发了两个NLP模型以识别远处的癌症复发，复发时间，复发时间，基于非结构化电子健康记录数据的复发位点。这些模型可用于在肿瘤学上进行大规模的回顾性研究。

Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data

摘要

著录项

相似文献

相关主题

期刊订阅