A Japanese Word Segmenter Using a PPM-based Language Model

HIROKI ODA; KENJI KITA

首页> 外文期刊>情報処理学会論文誌 >A Japanese Word Segmenter Using a PPM-based Language Model

【24h】

A Japanese Word Segmenter Using a PPM-based Language Model

机译：使用基于PPM的语言模型的日语分词器

获取原文

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Word segmentation, which segments an input sentence into words, is the most fundamental process of Japanese language processing. In this paper, we present a new method for segmenting the input sentence into words, which is suitable for those languages that have no delimiter between words, such as Japanese and Chinese. First, we present a word segmentation model using a character--based n-gram model, which is our baseline method. Next, we apply the PPM compression algorithm to the problem of word segmentation. PPM (Prediction by Partial Matchiflg) is a lossless compression algorithm based on a finite-context probabilistic modeling technique and PPM is a variant of PPM, in which there is no a priori bound on con- text length. As the result of experiments on the ADD (ATR Dialogue Database) corpus, the proposed Japanese word segmenter using the PPM--based language model marked a higher accuracy than the character-based n-gram model. In particular, the proposed method using the PPM-based language model achieved 97.67/100 recall and 98.27/100 precision for open text.

机译：将输入句子分割成单词的分词是日语处理的最基本过程。在本文中，我们提出了一种将输入句子分割为单词的新方法，该方法适用于单词之间没有定界符的语言，例如日语和汉语。首先，我们使用基于字符的n元语法模型介绍一个分词模型，这是我们的基线方法。接下来，我们将PPM压缩算法应用于分词问题。 PPM（部分匹配预测）是一种基于有限上下文概率建模技术的无损压缩算法，而PPM是PPM的一种变体，其中上下文长度没有先验约束。作为ADD（ATR对话数据库）语料库的实验结果，使用基于PPM的语言模型提出的日语分词器比基于字符的n-gram模型具有更高的准确性。尤其是，所提出的使用基于PPM的语言模型的方法对于开放文本实现了97.67 / 100的查全率和98.27 / 100的精度。

著录项

来源
《情報処理学会論文誌》 |2000年第3期|p.289-700|共412页
作者
HIROKI ODA; KENJI KITA;
展开▼
作者单位

展开▼
收录信息美国《科学引文索引》(SCI);
原文格式 PDF
正文语种 jpn
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Structure and modeling of the network of two-Chinese-character compound words in the Japanese language [J] . Ken Yamamoto, Yoshihiro Yamazaki Physica, A. Statistical mechanics and its applications . 2014,第Null期

机译：日语中两个汉字复合词网络的结构与建模
2. Cross-Sectional Analysis for Matching Words to Concepts in Japanese and English Languages [J] . Katsuo Sugita, Natsumi Suzuki, Kyoko OF, International medical journal: IMJ . 2010,第1期

机译：日语和英语中单词与概念的匹配的跨部分分析
3. Rapid gains in segmenting fluent speech when words match the rhythmic unit: evidence from infants acquiring syllable-timed languages [J] . Laura Bosch, Mel??nia Figueras, Maria Teixid?3, Frontiers in Psychology . 2013,第4期

机译：当单词与节奏单位匹配时，快速流利的语音分割便迅速获得：婴儿获得音节定时语言的证据
4. A method for recognizing a sequence of sign language words represented in a Japanese sign language sentence [C] . Sagawa, H., Takeuchi, . 2000

机译：一种识别日语手语句子中表示的手语单词序列的方法
5. Acquisition versus long-term retention of Japanese words and syntax by children and adults: Implications for the critical period hypothesis in second language learning. [D] . Boswell, Paul Duane. 1993

机译：儿童和成年人对日语单词和语法的习得与长期保留：对第二语言学习中关键时期假设的影响。
6. Rapid gains in segmenting fluent speech when words match the rhythmic unit: evidence from infants acquiring syllable-timed languages [O] . Laura Bosch, Melània Figueras, Maria Teixidó, 2013

机译：当单词与节奏单位匹配时快速流利的语音细分便迅速获得：婴儿获得音节定时语言的证据
7. Structure and modeling of the network of two-Chinese-character compound words in the Japanese language [O] . Yamamoto, Ken, Yamazaki, Yoshihiro 2014

机译：双汉字复合网络的结构与建模日语中的单词
8. Word and Subword Modelling in a Segment-Based HMM Word Spotter Using a Data Analytic Approach. [R] . Marcus, J. N. 1992

机译：基于分段的Hmm词识别器中的词和子词建模使用数据分析方法。

A Japanese Word Segmenter Using a PPM-based Language Model

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅