A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text

Ying Xiong; Zhongmin Wang; Dehuan Jiang; Xiaolong Wang; Qingcai Chen; Hua Xu; Jun Yan; Buzhou Tang

首页> 外文期刊>BMC Medical Informatics and Decision Making >A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text

【24h】

A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text

机译：用于临床文本的细粒度中文分词和词性标注语料库

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing. They are usually preliminary steps for lots of Chinese natural language processing (NLP) tasks. There have been a large number of studies on CWS and POS tagging in various domains, however, few studies have been proposed for CWS and POS tagging in the clinical domain as it is not easy to determine granularity of words. In this paper, we investigated CWS and POS tagging for Chinese clinical text at a fine-granularity level, and manually annotated a corpus. On the corpus, we compared two state-of-the-art methods, i.e., conditional random fields (CRF) and bidirectional long short-term memory (BiLSTM) with a CRF layer. In order to validate the plausibility of the fine-grained annotation, we further investigated the effect of CWS and POS tagging on Chinese clinical named entity recognition (NER) on another independent corpus. When only CWS was considered, CRF achieved higher precision, recall and F-measure than BiLSTM-CRF. When both CWS and POS tagging were considered, CRF also gained an advantage over BiLSTM. CRF outperformed BiLSTM-CRF by 0.14% in F-measure on CWS and by 0.34% in F-measure on POS tagging. The CWS information brought a greatest improvement of 0.34% in F-measure, while the CWS&POS information brought a greatest improvement of 0.74% in F-measure. Our proposed fine-grained CWS and POS tagging corpus is reliable and meaningful as the output of the CWS and POS tagging systems developed on this corpus improved the performance of a Chinese clinical NER system on another independent corpus.

机译：中文分词（CWS）和词性（POS）标记是中文文本处理的两个基本任务。它们通常是许多中文自然语言处理（NLP）任务的初步步骤。在各个领域中已经进行了大量有关CWS和POS标签的研究，但是，由于难以确定单词的粒度，因此在临床领域中针对CWS和POS标签的研究很少。在本文中，我们以细粒度级别研究了中文临床文本的CWS和POS标签，并手动注释了一个语料库。在语料库上，我们比较了两种最先进的方法，即带有CRF层的条件随机字段（CRF）和双向长短期记忆（BiLSTM）。为了验证细粒度注释的合理性，我们进一步研究了CWS和POS标记对另一个独立语料库上的中国临床命名实体识别（NER）的影响。仅考虑CWS时，CRF比BiLSTM-CRF获得更高的精度，召回率和F量度。当同时考虑CWS和POS标记时，CRF也比BiLSTM更具优势。在CWS上，CRF在F量度方面比BiLSTM-CRF高出0.14％，而在POS标记中，F量度上优于BiLSTM-CRF。 CWS信息的F值最大改进为0.34％，而CWS＆POS信息的F值最大改进为0.74％。我们建议的细粒度CWS和POS标记语料库是可靠且有意义的，因为在该语料库上开发的CWS和POS标记系统的输出提高了另一个独立语料库中中国临床NER系统的性能。

著录项

来源
《BMC Medical Informatics and Decision Making》 |2019年第2期|共6页
作者
Ying Xiong; Zhongmin Wang; Dehuan Jiang; Xiaolong Wang; Qingcai Chen; Hua Xu; Jun Yan; Buzhou Tang;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类医药、卫生;
关键词
Fine-grained Chinese word segmentationPart-of-speech taggingClinical named entity recognition;

机译：细粒度中文分词;词性标注;临床命名实体识别;
入库时间 2022-08-18 05:36:50

相似文献

外文文献
中文文献
专利

1. Chinese Text Similarity Algorithm Based on Part-of-Speech Tagging and Word Vector Model [J] . Zhixin Ma, Mengguang Li Journal of Computers . 2019,第4期

机译：基于词性标注和词向量模型的中文文本相似度算法
2. Fine-grained part-of-speech tagging in Nepali text [J] . Ingroj Shrestha, Shreeya Singh Dhakal Procedia Computer Science . 2021,第a期

机译：在尼泊尔文本中细粒度的致辞标记
3. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese [J] . Erick R Fonseca, Jo#227, o Lu#237, Brazilian Computer Society. Journal . 2015,第1期

机译：评估葡萄牙语中词性标记的词嵌入和修订的语料库
4. Part-of-speech tagging for Chinese unknown words in a domain-specific small corpus using morphological and contextual rules [C] . Chang Tao-Hsing, Hsu Fu-Yuan, Lee Chia-Hoang, International Conference on Natural Language Processing and Knowledge Engineering . 2010

机译：使用形态和上下文规则对特定领域的小型语料库中的中文未知词进行词性标记
5. IITagger: Tagging Wall Street Journal text with part-of-speech information [D] . Kim, Yeongkwun 1996

机译：IITagger：使用词性信息标记“华尔街日报”文本
6. A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text [O] . Ying Xiong, Zhongmin Wang, Dehuan Jiang, 2019

机译：用于临床文本的细粒度中文分词和词性标注语料库
7. A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text [O] . Ying Xiong, Zhongmin Wang, Dehuan Jiang, 2019

机译：临床文本的一个细粒度的汉语词分割和词语标记语料库

A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text

摘要

著录项

相似文献

相关主题

期刊订阅