Lexicon-based semi-CRF for Chinese clinical text word segmentation

机译：基于词汇的SEMI-CRF用于中国临床文本词分割

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Word segmentation is in most cases a base for text analysis and absolutely vital to the accuracy of subsequent natural language processing (NLP) tasks. While word segmentation for normal text has been intensively studied and quite a few algorithms have been proposed, these algorithms however do not work well in special fields, e.g., in clinical text analysis. Besides, most state-of-the-art methods have difficulties in identifying out-of-vocabulary (OOV) words. For these two reasons, in this paper, we propose a semi-supervised CRF (semi-CRF) algorithm for Chinese clinical text word segmentation. Semi-CRF is implemented by modifying the learning objective so as to adapt for partial labeled data. Training data are obtained by applying a bidirectional lexicon matching scheme. A modified Viterbi algorithm using lexicon matching scheme is also proposed for word segmentation on raw sentences. Experiments show that our model has a precision of 93.88% on test data and outperforms two popular open source Chinese word segmentation tools i.e., HanLP and THULAC. By using lexicon, our model is able to be adapted for other domain text word segmentation.

机译：在大多数情况下，单词分割是文本分析的基础，对后续自然语言处理（NLP）任务的准确性绝对至关重要。虽然已经深入研究了正常文本的单词分割，并且已经提出了相当多的算法，但这些算法在特殊领域，例如，在临床文本分析中不起作用。此外，大多数最先进的方法在识别词汇外（OOV）单词方面具有困难。由于这两个原因，在本文中，我们提出了一种半监督CRF（半CRF）算法，用于中国临床文本词分割。通过修改学习目标来实现半CRF，以便适应部分标记的数据。通过应用双向词典匹配方案获得训练数据。还提出了一种修改的Viterbi算法，用于原始句子上的单词分段。实验表明，我们的模型在测试数据上具有93.88 ％的精确度，并且优于两个流行的开源中文字分割工具即，Hanlp和Thulac。通过使用Lexicon，我们的模型能够适用于其他域文本字分段。

著录项

来源
《International Conference on Progress in Informatics and Computing》|2017年|478p|共6页
会议地点
作者
Guoqing Xia; Yao Shen; Qiang Lin;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP14-53;
关键词
Task analysis; Training; Training data; Pathology; Adaptation models; Labeling; Natural language processing;

机译：任务分析;培训;培训数据;病理学;适应模型;标签;自然语言处理;

相似文献

外文文献
中文文献
专利

1. Detecting New Words from Chinese Text Using Latent Semi-CRF Models [J] . Xiao SUN, Degen HUANG, Fuji REN IEICE transactions on information and systems . 2010,第6期

机译：使用潜在的半CRF模型从中文文本中检测新单词
2. Detecting New Words from Chinese Text Using Latent Semi-CRF Models [J] . Xiao SUN, Degen HUANG, Fuji REN IEICE Transactions on Information and Systems . 2010,第6期

机译：使用潜在的半CRF模型从中文文本中检测新单词
3. Automatic Extraction Of New Words Based On Google News Corpora For Supporting Lexicon-based Chinese Word Segmentation Systems [J] . Chin-Ming Hong, Chih-Ming Chen, Chao-Yang Chiu Expert systems with applications . 2009,第2p2期

机译：基于Google新闻语料库的自动提取新词以支持基于词典的中文分词系统
4. Lexicon-based semi-CRF for Chinese clinical text word segmentation [C] . Guoqing Xia, Yao Shen, Qiang Lin International Conference on Progress in Informatics and Computing . 2017

机译：基于词汇的半CRF用于中文临床文本分词
5. Experimental comparison of discriminative learning approaches for Chinese word segmentation. [D] . Song, Dong. 2008

机译：判别学习方法对中文分词的实验比较。
6. A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text [O] . Ying Xiong, Zhongmin Wang, Dehuan Jiang, 2019

机译：用于临床文本的细粒度中文分词和词性标注语料库
7. A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text [O] . Ying Xiong, Zhongmin Wang, Dehuan Jiang, 2019

机译：临床文本的一个细粒度的汉语词分割和词语标记语料库

Lexicon-based semi-CRF for Chinese clinical text word segmentation

摘要

著录项

相似文献

相关主题

期刊订阅