首页> 外文会议>International Joint Conference on Natural Language Processing >Chinese New Word Finding Using Character-Based Parsing Model
【24h】

Chinese New Word Finding Using Character-Based Parsing Model

机译:使用基于角色的解析模型的中国新词发现

获取原文

摘要

The new word finding is a difficult and indispensable task in Chinese segmentation. The traditional methods used the string statistical information to identify the new words in the large-scale corpus. But it is neither convenient nor powerful enough to describe the words' internal and external structure laws. And it is even the less effective when the occurrence frequency of the new words is very low in the corpus. In this paper, we present a novel method of using parsing information to find the new words. A character level PCFG model is trained by People Daily corpus and Penn Chinese Treebank. The characters are inputted into the character parsing system, and the words are determined by the parsing tree automatically. Our method describes the word-building rules in the full sentences, and takes advantage of rich context to find the new words. This is especially effective in identifying the occasional words or rarely used words, which are usually in low frequency. The preliminary experiments indicate that our method can substantially improve the precision and recall of the new word finding process.
机译:新的单词发现是中文分割中的一个困难而不可或缺的任务。传统方法使用字符串统计信息来识别大规模语料库中的新单词。但它既不方便也不足以描述单词“内部和外部结构法”。当语料库中新单词的发生频率非常低时,它甚至是较小的。在本文中,我们提出了一种使用解析信息来查找新单词的新方法。一个字符级PCFG模型由人们每日语料库和Penn Chinese TreeBank培训。字符被输入到字符解析系统中,并且单词由解析树自动确定。我们的方法描述了完整句子中的文字构建规则,并利用丰富的上下文来查找新单词。这对于识别偶尔单词或很少使用的单词特别有效,这些单词通常处于低频状态。初步实验表明,我们的方法可以大大提高新词发现过程的精度和召回。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号