首页> 外文会议>International Conference on Progress in Informatics and Computing >Lexicon-based semi-CRF for Chinese clinical text word segmentation
【24h】

Lexicon-based semi-CRF for Chinese clinical text word segmentation

机译:基于词汇的半CRF用于中文临床文本分词

获取原文

摘要

Word segmentation is in most cases a base for text analysis and absolutely vital to the accuracy of subsequent natural language processing (NLP) tasks. While word segmentation for normal text has been intensively studied and quite a few algorithms have been proposed, these algorithms however do not work well in special fields, e.g., in clinical text analysis. Besides, most state-of-the-art methods have difficulties in identifying out-of-vocabulary (OOV) words. For these two reasons, in this paper, we propose a semi-supervised CRF (semi-CRF) algorithm for Chinese clinical text word segmentation. Semi-CRF is implemented by modifying the learning objective so as to adapt for partial labeled data. Training data are obtained by applying a bidirectional lexicon matching scheme. A modified Viterbi algorithm using lexicon matching scheme is also proposed for word segmentation on raw sentences. Experiments show that our model has a precision of 93.88% on test data and outperforms two popular open source Chinese word segmentation tools i.e., HanLP and THULAC. By using lexicon, our model is able to be adapted for other domain text word segmentation.
机译:在大多数情况下,分词是进行文本分析的基础,对于后续自然语言处理(NLP)任务的准确性绝对至关重要。虽然已经对普通文本的单词分割进行了深入研究并提出了许多算法,但是这些算法在特定领域,例如临床文本分析中不能很好地工作。此外,大多数最先进的方法都难以识别词汇外(OOV)单词。由于这两个原因,在本文中,我们提出了一种用于中文临床文本分词的半监督CRF(semi-CRF)算法。通过修改学习目标来实现Semi-CRF,以适应部分标记的数据。训练数据是通过应用双向词典匹配方案获得的。还提出了一种基于词典匹配方案的改进的维特比算法,用于对原始句子进行分词。实验表明,我们的模型在测试数据上的精度为93.88%,并且优于两种流行的开源中文分词工具HanLP和THULAC。通过使用词典,我们的模型能够适用于其他领域文本分词。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号