首页> 外文期刊>Language Resources and Evaluation >A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annotated Text corpus (MERLOT)
【24h】

A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annotated Text corpus (MERLOT)

机译:具有全面语义注释的法语临床语料库:医学实体和关系LIMSI注释文本语料库(MERLOT)的开发

获取原文
获取原文并翻译 | 示例

摘要

Quality annotated resources are essential for Natural Language Processing. The objective of this work is to present a corpus of clinical narratives in French annotated for linguistic, semantic and structural information, aimed at clinical information extraction. Six annotators contributed to the corpus annotation, using a comprehensive annotation scheme covering 21 entities, 11 attributes and 37 relations. All annotators trained on a small, common portion of the corpus before proceeding independently. An automatic tool was used to produce entity and attribute pre-annotations. About a tenth of the corpus was doubly annotated and annotation differences were resolved in consensus meetings. To ensure annotation consistency throughout the corpus, we devised harmonization tools to automatically identify annotation differences to be addressed to improve the overall corpus quality. The annotation project spanned over 24 months and resulted in a corpus comprising 500 documents (148,476 tokens) annotated with 44,740 entities and 26,478 relations. The average inter-annotator agreement is 0.793 F-measure for entities and 0.789 for relations. The performance of the pre-annotation tool for entities reached 0.814 F-measure when sufficient training data was available. The performance of our entity pre-annotation tool shows the value of the corpus to build and evaluate information extraction methods. In addition, we introduced harmonization methods that further improved the quality of annotations in the corpus.
机译:带有质量注释的资源对于自然语言处理至关重要。这项工作的目的是用法语为语言,语义和结构信息提供临床叙事语料,目的是提取临床信息。六个注释器使用覆盖21个实体,11个属性和37个关系的综合注释方案,为语料库注释做出了贡献。在独立进行之前,所有注释者都在主体的一小部分共同部分上进行了培训。使用了一个自动工具来产生实体和属性预注释。大约有十分之一的语料被加倍注释,注释差异在共识会议中得以解决。为了确保整个语料库的注解一致性,我们设计了统一工具来自动识别要解决的注解差异,以提高整体语料库质量。注释项目跨越了24个月,并形成了一个包含500个文档(148,476个令牌)的语料库,并注释了44,740个实体和26,478个关系。批注者之间的平均协议对实体而言为0.793 F度量,对于关系而言为0.789。当有足够的培训数据时,用于实体的预注释工具的性能达到0.814 F-measure。我们的实体预注释工具的性能显示了语料库在构建和评估信息提取方法方面的价值。此外,我们引入了协调方法,进一步提高了语料库中注释的质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号