首页> 外文期刊>情報処理学会論文誌 >A Japanese Word Segmenter Using a PPM-based Language Model
【24h】

A Japanese Word Segmenter Using a PPM-based Language Model

机译:使用基于PPM的语言模型的日语分词器

获取原文

摘要

Word segmentation, which segments an input sentence into words, is the most fundamental process of Japanese language processing. In this paper, we present a new method for segmenting the input sentence into words, which is suitable for those languages that have no delimiter between words, such as Japanese and Chinese. First, we present a word segmentation model using a character--based n-gram model, which is our baseline method. Next, we apply the PPM compression algorithm to the problem of word segmentation. PPM (Prediction by Partial Matchiflg) is a lossless compression algorithm based on a finite-context probabilistic modeling technique and PPM is a variant of PPM, in which there is no a priori bound on con- text length. As the result of experiments on the ADD (ATR Dialogue Database) corpus, the proposed Japanese word segmenter using the PPM--based language model marked a higher accuracy than the character-based n-gram model. In particular, the proposed method using the PPM-based language model achieved 97.67/100 recall and 98.27/100 precision for open text.
机译:将输入句子分割成单词的分词是日语处理的最基本过程。在本文中,我们提出了一种将输入句子分割为单词的新方法,该方法适用于单词之间没有定界符的语言,例如日语和汉语。首先,我们使用基于字符的n元语法模型介绍一个分词模型,这是我们的基线方法。接下来,我们将PPM压缩算法应用于分词问题。 PPM(部分匹配预测)是一种基于有限上下文概率建模技术的无损压缩算法,而PPM是PPM的一种变体,其中上下文长度没有先验约束。作为ADD(ATR对话数据库)语料库的实验结果,使用基于PPM的语言模型提出的日语分词器比基于字符的n-gram模型具有更高的准确性。尤其是,所提出的使用基于PPM的语言模型的方法对于开放文本实现了97.67 / 100的查全率和98.27 / 100的精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号