首页> 外文期刊>International Journal of Population Data Science >Using Biomedical Text as Data and Representation Learning for Identifying Patients with an Osteoarthritis Phenotype in the Electronic Medical Record
【24h】

Using Biomedical Text as Data and Representation Learning for Identifying Patients with an Osteoarthritis Phenotype in the Electronic Medical Record

机译:使用生物医学文本作为数据和表征学习来识别电子病历中的骨关节炎表型患者

获取原文
           

摘要

IntroductionElectronic medical records (EMRs) are increasingly used in health services research. Accurate/efficient identification of a target population with a specific disease phenotype is a necessary precursor to studying the health of these individuals. Objectives and ApproachWe explored the use of biomedical text as inputs to supervised phenotype identification algorithms. We employed a two-stage classification approach to map the discrete, sparse high-dimensional biomedical text data to a dense low dimensional vector space using methods from unsupervised machine learning. Next we used these learned vectors as inputs to supervised machine learning algorithms for phenotype identification. We were able to demonstrate the applicability of the approach to identifying patients with an osteoarthritis (OA) phenotype using primary care data from the Electronic Medical Record Administrative data Linked Database (EMRALD) held at ICES. ResultsEMRALD contains approximately 20Gb of biomedical text data on approximately 500,000 patients. The unit of analysis for this study is the patient. We were interested in identifying OA patients using solely text data as features. Labelled outcome information wass available from a random sample of 7,500 patients. We divided patients into training (N=6000), validation (N=750) and test (N=750) cohorts. We learned low dimensional representations of the input text data on the entire EMRALD corpus (N=500,000). We used learned numeric vectors as inputs to supervised machine learning models for OA classification (N=6,000 training set patients). We compared models in terms of accuracy, sensitivity, specificity, PPV and NPV. The best learned models achieved approximately 90% sensitivity and 80% specificity. Classification accuracy varied as a function of learned inputs. Conclusion/ImplicationsWe developed an approach to phenotype identification using solely biomedical text as an input. Preliminary results suggest our two-stage ML approach has improved operating characteristics compared to existing clinically derived decision rules for OA classification. Future work will explore the generalizability of this methodology to other disease phenotypes.
机译:简介电子病历(EMR)越来越多地用于健康服务研究中。准确/有效地鉴定具有特定疾病表型的目标人群是研究这些个体健康的必要先决条件。目的和方法我们探索了将生物医学文本用作监督表型识别算法的输入的方法。我们采用了两阶段分类方法,使用无监督机器学习方法将离散的,稀疏的高维生物医学文本数据映射到密集的低维向量空间。接下来,我们将这些学习到的向量用作有监督的机器学习算法的表型识别输入。我们能够使用ICES的电子病历管理数据链接数据库(EMRALD)的初级护理数据证明该方法在识别骨关节炎(OA)表型患者中的适用性。结果EMRALD包含大约500,000名患者的大约20Gb生物医学文本数据。这项研究的分析单位是患者。我们对仅使用文本数据作为特征来识别OA患者感兴趣。可以从7500名患者的随机样本中获得标记的结果信息。我们将患者分为训练(N = 6000),验证(N = 750)和测试(N = 750)队列。我们学习了整个EMRALD语料库(N = 500,000)上输入文本数据的低维表示。我们将学习到的数值向量用作OA分类的监督机器学习模型的输入(N = 6,000个训练集患者)。我们比较了模型的准确性,敏感性,特异性,PPV和NPV。最佳学习的模型可达到约90%的灵敏度和80%的特异性。分类准确度根据学习的输入而变化。结论/意义我们开发了一种仅使用生物医学文本作为输入的表型鉴定方法。初步结果表明,与现有临床得出的OA分类决策规则相比,我们的两阶段ML方法具有改进的操作特性。未来的工作将探索这种方法对其他疾病表型的推广。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号