首页> 外文期刊>BMC Medical Informatics and Decision Making >A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text
【24h】

A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text

机译:用于临床文本的细粒度中文分词和词性标注语料库

获取原文
       

摘要

Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing. They are usually preliminary steps for lots of Chinese natural language processing (NLP) tasks. There have been a large number of studies on CWS and POS tagging in various domains, however, few studies have been proposed for CWS and POS tagging in the clinical domain as it is not easy to determine granularity of words. In this paper, we investigated CWS and POS tagging for Chinese clinical text at a fine-granularity level, and manually annotated a corpus. On the corpus, we compared two state-of-the-art methods, i.e., conditional random fields (CRF) and bidirectional long short-term memory (BiLSTM) with a CRF layer. In order to validate the plausibility of the fine-grained annotation, we further investigated the effect of CWS and POS tagging on Chinese clinical named entity recognition (NER) on another independent corpus. When only CWS was considered, CRF achieved higher precision, recall and F-measure than BiLSTM-CRF. When both CWS and POS tagging were considered, CRF also gained an advantage over BiLSTM. CRF outperformed BiLSTM-CRF by 0.14% in F-measure on CWS and by 0.34% in F-measure on POS tagging. The CWS information brought a greatest improvement of 0.34% in F-measure, while the CWS&POS information brought a greatest improvement of 0.74% in F-measure. Our proposed fine-grained CWS and POS tagging corpus is reliable and meaningful as the output of the CWS and POS tagging systems developed on this corpus improved the performance of a Chinese clinical NER system on another independent corpus.
机译:中文分词(CWS)和词性(POS)标记是中文文本处理的两个基本任务。它们通常是许多中文自然语言处理(NLP)任务的初步步骤。在各个领域中已经进行了大量有关CWS和POS标签的研​​究,但是,由于难以确定单词的粒度,因此在临床领域中针对CWS和POS标签的研​​究很少。在本文中,我们以细粒度级别研究了中文临床文本的CWS和POS标签,并手动注释了一个语料库。在语料库上,我们比较了两种最先进的方法,即带有CRF层的条件随机字段(CRF)和双向长短期记忆(BiLSTM)。为了验证细粒度注释的合理性,我们进一步研究了CWS和POS标记对另一个独立语料库上的中国临床命名实体识别(NER)的影响。仅考虑CWS时,CRF比BiLSTM-CRF获得更高的精度,召回率和F量度。当同时考虑CWS和POS标记时,CRF也比BiLSTM更具优势。在CWS上,CRF在F量度方面比BiLSTM-CRF高出0.14%,而在POS标记中,F量度上优于BiLSTM-CRF。 CWS信息的F值最大改进为0.34%,而CWS&POS信息的F值最大改进为0.74%。我们建议的细粒度CWS和POS标记语料库是可靠且有意义的,因为在该语料库上开发的CWS和POS标记系统的输出提高了另一个独立语料库中中国临床NER系统的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号