首页> 外文会议>International conference on computational linguistics >Discriminative Boosting from Dictionary and Raw Text - A Novel Approach to Build A Chinese Word Segmenter
【24h】

Discriminative Boosting from Dictionary and Raw Text - A Novel Approach to Build A Chinese Word Segmenter

机译:从字典和原始文本中辨别提升 - 一种建立中文分条的新方法

获取原文

摘要

Chinese word segmentation (CWS) is a basic and important task for Chinese information processing. Standard approaches to CWS treat it as a sequence labelling task. Without manually annotated corpora, these approaches are ineffective. When a dictionary is available, dictionary maximum matching (DMM) is a good alternative. However, its performance is far from perfect due to the poor ability on out-of-vocabulary (OOV) words recognition. In this paper, we propose a novel approach that integrates the advantages of discriminative training and DMM, to build a high quality word segmenter with only a dictionary and a raw text. Experiments in CWS on different domains show that, compared with DMM, our approach brings significant improvements in both the news domain and the Chinese medicine patent domain, with error reductions of 21.50% and 13.66%, respectively. Furthermore, our approach achieves recall rate increments of OOV words by 42.54% and 23.72%, respectively in both domains.
机译:中文字分割(CWS)是中文信息处理的基本和重要任务。 CWS的标准方法将其视为序列标记任务。没有手动注释的Corpora,这些方法无效。当字典可用时,字典最大匹配(DMM)是一个很好的替代方案。然而,由于失控(OOV)单词识别的能力差,其性能远非完美。在本文中,我们提出了一种新颖的方法,整合了歧视性培训和DMM的优势,只有一个字典和原始文本构建一个高质量的单词分段器。不同领域的CWS的实验表明,与DMM相比,我们的方法在新闻领域和中药专利域中进行了显着改善,分别折销21.50%和13.66%。此外,我们的方法在两个域中分别达到OoV单词的速度增量42.54%和23.72%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号