首页> 外文OA文献 >Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records
【2h】

Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records

机译:生成用于监督电子健康记录的综合培训数据

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

A major hurdle in the development of natural language processing (NLP) methods for Electronic Health Records (EHRs) is the lack of large, annotated datasets. Privacy concerns prevent the distribution of EHRs, and the annotation of data is known to be costly and cumbersome. Synthetic data presents a promising solution to the privacy concern, if synthetic data has comparable utility to real data and if it preserves the privacy of patients. However, the generation of synthetic text alone is not useful for NLP because of the lack of annotations. In this work, we propose the use of neural language models (LSTM and GPT-2) for generating artificial EHR text jointly with annotations for named-entity recognition. Our experiments show that artificial documents can be used to train a supervised named-entity recognition model for de-identification, which outperforms a state-of-the-art rule-based baseline. Moreover, we show that combining real data with synthetic data improves the recall of the method, without manual annotation effort. We conduct a user study to gain insights on the privacy of artificial text. We highlight privacy risks associated with language models to inform future research on privacy-preserving automated text generation and metrics for evaluating privacy-preservation during text generation.
机译:在电子健康记录(电子病历)自然语言处理(NLP)方法的发展的一个主要障碍是缺乏大型的,带有注释的数据集。隐私问题防止电子病历的分布,以及数据的注释被称为是昂贵和繁琐。合成数据提出了一个有前途的解决隐私问题,如果合成数据具有可比性效用真实的数据,如果它保留了患者的隐私。然而,合成文本的生成本身并不是对NLP因为缺少注释有用。在这项工作中,我们提出了利用神经语言模型(LSTM和GPT-2)与注释的命名实体识别共同产生的人工电子病历文本。我们的实验表明,人工文件可以用于训练去标识,它优于国家的最先进的基于规则的基线监督命名实体识别模型。此外,我们表明,真实数据与合成数据结合起来,改善了该方法的调用,无需人工标注的努力。我们会进行用户研究,以获得人工文字的隐私见解。我们强调与语言模型相关的对隐私保护的自动文本生成告知未来研究的隐私风险度量文本生成中评价隐私保护。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号