This paper describes a character-based n-gram language model. The proposed model is based on Kanji and Kana character instead of word or morphemic determined by morphemic analysis. To exploit stronger character strings are used in addition to single characters as basic units of the model. We examined two methods to choose character strings. One method is based on frequency in the training corpus, and the other is based on mutual information as well as the frequency. We carried out experiments to compare perplexities and character error rates (CER) between the proposed model and conventional (word or character based) n-gram model. The results showed that the mutual information based method gave the better performance. Although the proposed model was not superior to the word-based model, it was better than the character-based one. The vocabulary size of the proposed model was about 50 smaller than that of word-based model.
展开▼