首页> 外文期刊>Knowledge-Based Systems >Combining contextualized word representation and sub-document level analysis through Bi-LSTM+CRF architecture for clinical de-identification
【24h】

Combining contextualized word representation and sub-document level analysis through Bi-LSTM+CRF architecture for clinical de-identification

机译:通过BI-LSTM + CRF架构结合上下文化的单词表示和子文档级别分析,用于临床去除识别

获取原文
获取原文并翻译 | 示例

摘要

Clinical de-identification aims to identify Protected Health Information in clinical data, enabling data sharing and publication. First automatic de-identification systems were based on rules or on machine learning methods, limited by language changes, lack of context awareness and time consuming feature engineering. Newer deep learning techniques for sequence labeling have shown better results with a reduction in feature engineering efforts and the use of word representation techniques in vector space. However, they are not able to jointly represent the polysemic and context-dependent nature of words, as well as their morpho-syntactic mutations characteristic of handwriting. To address these limitations, a new de-identification approach based on deep learning techniques for Named Entity Recognition has been proposed, whose key factors are: (i) a Bidirectional Long Short-Term Memory + Conditional Random Field architecture for sequence labeling that takes advantage of the widest possible representation context; (ii) a contextualized language model, working at character level, to capture the polysemy of words and manage the morpho-syntactic variations typical of handwritten notes; (iii) more word representations stacked to better capture latent syntactic and semantic similarities. This approach has been tested on the official Informatics for Integrating Biology & the Bedside 2014 de-identification dataset, showing similar or higher performance than state of the art with respect to category and binary recognition, but without any feature engineering or handcrafted rules. The experiments demonstrate the effectiveness of the proposed approach, in particular with regard to category level recognition which is essential to correctly replace entities with surrogates for anonymization purposes. (C) 2020 Elsevier B.V. All rights reserved.
机译:临床去识别旨在在临床数据中识别受保护的健康信息,从而实现数据共享和出版物。第一自动取消识别系统基于规则或机器学习方法,受语言变化的限制,缺乏上下文意识和耗时的特征工程。较新的序列标签的深度学习技术表现出更好的结果,随着特征工程工作的减少和矢量空间中的单词表示技术的使用。然而,它们无法共同代表单词的多种态性和依赖性的性质,以及他们手写的形态语法突变。为了解决这些限制,已经提出了一种基于用于命名实体识别的深度学习技术的新去识别方法,其关键因素是:(i)用于序列标记的双向长期内存+条件随机现场架构,其利用最宽可能的表示环境; (ii)一个语境化语言模型,在字符级别工作,以捕获单词的多义,管理手写笔记的典型形态学变化; (iii)堆叠以更好地捕获潜在句法和语义相似性的更多字表示。这种方法已经在官方信息学上进行了整合生物学和床边的2014年去识别数据集,而是表现出与类别和二进制识别的最先进的类似或更高的性能,但没有任何特征工程或手工规则。实验证明了所提出的方法的有效性,特别是关于类别水平识别,这对于正确替换具有替代匿名化目的的特征至关重要。 (c)2020 Elsevier B.v.保留所有权利。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号