首页> 外文会议>Linguistic annotation workshop >Annotating a Large Representative Corpus of Clinical Notes for Parts of Speech
【24h】

Annotating a Large Representative Corpus of Clinical Notes for Parts of Speech

机译:为言论的部分注释一个大型代表性的临床报纸语料库

获取原文

摘要

We report of the procedures of developing a large representative corpus of 50,000 sentences taken from clinical notes. Previous reports of annotated corpus of clinical notes have been small and they do not represent the whole domain of clinical notes. The sentences included in this corpus have been selected from a very large raw corpus of ten thousand documents. These ten thousand documents are sampled from an internal repository of more than 700,000 documents taken from multiple health care providers. Each of the documents is de-identified to remove any PHI data. Using the Penn Treebank tagging guidelines with a bit of modifications, we annotate this corpus manually with an average inter-annotator agreement of more than 98%. The goal is to create a parts of speech annotated corpus in the clinical domain that is comparable to the Penn Treebank and also represents the totality of the contemporary text as used in the clinical domain. We also report the output of the TnT tagger trained on the initial 21,000 annotated sentences reaching a preliminary accuracy of above 96%.
机译:我们报告了从临床笔记中制定了50,000个句子的大型代表性核查程序的程序。以前关于临床笔记的注释语料库的报告已经很小,并且它们不代表临床笔记的整个领域。此语料库中包含的句子已选中从一万件的大型原始语料库中选择。从多个医疗提供者中采取超过700,000份文件的内部存储库采样这一万件文件。将删除每个文档以删除任何PHI数据。使用Penn TreeBank标记指标指标有一点修改,我们手动将此语料库注释为平均注释协议,超过98%。目标是在临床领域中创建一部分语音注释语料库,其与Penn TreeBank相当,并且还代表了临床结构域中使用的当代文本的整体。我们还报告了在初始21,000名注释句子上培训的TNT标签的输出达到96%以上的初步准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号