【24h】

Models and Algorithm of Chinese Word Segmentation

机译:汉语分词模型与算法

获取原文
获取原文并翻译 | 示例

摘要

Chinese word segmentation is of significance in Chinese Natural Language Processing (NLP). This paper proposes a statistical segmentation model, which integrates the character juncture model (CJM) with word bi-gram language model, and then designs a strategy for making an accurate and low-cost estimation for this model. The advantage of the proposed model is that it can employ the affinity of characters inside or outside a word and word co-occurrence information simultaneously to handle ambiguity. After investigating the differences between real and theoretical size of segmentation space, we apply A~* algorithm to perform segmentation without exhaustively searching all the potential segmentations. Experiments show that the proposed methods are efficient, achieving over 92% correct disambiguation and 84% unknown word correct identification respectively in our preliminary tests.
机译:中文分词在中文自然语言处理(NLP)中具有重要意义。本文提出了一种统计分割模型,该模型将字符接合点模型(CJM)与单词二元语法模型结合在一起,然后设计了一种对该模型进行准确且低成本估计的策略。提出的模型的优点在于,它可以同时利用单词内部或外部的字符亲和力以及单词共现信息来处理歧义。在研究了分割空间的实际大小与理论大小之间的差异之后,我们应用A〜*算法执行分割,而没有详尽地搜索所有可能的分割。实验表明,所提出的方法是有效的,在我们的初步测试中,分别实现了92%以上的正确消歧和84%以上的未知词正确识别。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号