首页> 外文会议>International conference on computational linguistics >Discriminative Boosting from Dictionary and Raw Text - A Novel Approach to Build A Chinese Word Segmenter
【24h】

Discriminative Boosting from Dictionary and Raw Text - A Novel Approach to Build A Chinese Word Segmenter

机译:从字典和原始文本的区别提升-构建中文分词器的一种新方法。

获取原文

摘要

Chinese word segmentation (CWS) is a basic and important task for Chinese information processing. Standard approaches to CWS treat it as a sequence labelling task. Without manually annotated corpora, these approaches are ineffective. When a dictionary is available, dictionary maximum matching (DMM) is a good alternative. However, its performance is far from perfect due to the poor ability on out-of-vocabulary (OOV) words recognition. In this paper, we propose a novel approach that integrates the advantages of discriminative training and DMM, to build a high quality word segmenter with only a dictionary and a raw text. Experiments in CWS on different domains show that, compared with DMM, our approach brings significant improvements in both the news domain and the Chinese medicine patent domain, with error reductions of 21.50% and 13.66%, respectively. Furthermore, our approach achieves recall rate increments of OOV words by 42.54% and 23.72%, respectively in both domains.
机译:中文分词(CWS)是中文信息处理的一项基本而重要的任务。 CWS的标准方法将其视为序列标记任务。没有人工注释的语料库,这些方法是无效的。当词典可用时,词典最大匹配(DMM)是一个很好的选择。但是,由于语音外(OOV)单词识别能力差,其性能远非完美。在本文中,我们提出了一种新颖的方法,该方法结合了判别训练和DMM的优势,从而构建了仅具有字典和原始文本的高质量分词器。在CWS的不同领域进行的实验表明,与DMM相比,我们的方法在新闻领域和中药专利领域均带来了显着改进,错误率分别降低了21.50%和13.66%。此外,我们的方法在两个领域中均使OOV单词的回忆率分别提高了42.54%和23.72%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号