An apparatus is disclosed for generating a statistical class sequence model called class bi-multigram model from input strings of discrete-valued units, where bigram dependencies are assumed between adjacent variable length sequences of maximum length N units, and where class labels are assigned to the sequences. There are counted the number of times all sequences of units occur and the number of times all pairs of sequences of units co-occur in the input training strings of units, and an initial bigram probability distribution of all the pairs of sequences is computed as the counted number of times the two sequences co-occur divided by the counted number of times the first sequence occurs in the input training string. Then the input sequences are classified into a pre-specified desired number of classes. Further, an estimate of the bigram probability distribution of the sequences is calculated by using an EM algorithm to maximize the likelihood of the input training string computed with the input probability distributions, and the above processes are iteratively performed to generate a statistical class sequence model.
展开▼