In this thesis, the use of word posterior probabilities for large vocabulary continuous speech recognition is investigated in a unified, statistical framework. The word posterior probabilities are directly derived from the sentence posterior probabilities which are an essential part of Bayes' Decision Rule. Different approaches to the computation of these probabilities using N-best lists and word graphs are discussed, both theoretically and experimentally. The word posterior probabilities are used as confidence measures for various applications. It is shown that these probabilities are the best confidence measure among those studied in this work. The performance of the confidence measures is evaluated in a unified framework using two evaluation metrics and five highly different speech corpora. The relative reduction of the confidence error rates with the word posterior probabilities ranges between 18.6% and 35.4%. In order to show the usefulness of the suggested confidence measure, the word posterior probabilities are applied to restrict maximum-likelihood-linear-regression adaptation to those acoustic segments with a high confidence. In doing so, incorrectly recognised parts of the transcription can be excluded from the adaptation algorithm. Using this method, the word error rate is reduced by 4.8% relative on a German spontaneous-speech test set. In a very similar manner, the word posterior probabilities are used to train an American Broadcast News recogniser with automatically generated, i.e., recognised transcriptions. Only those parts of the acoustic training corpus are used where the confidence of the transcription is sufficiently high. In order to bootstrap an initial low-cost speech recognition system which can be used to recognise large quantities of untranscribed speech data for training purposes, a small amount of speech is transcribed manually. This small speech database with the manually generated transcriptions is then used to train the low-cost speech recogniser, which can be used to recognise the training corpus. Finally, the process of recognising the training corpus and of estimating the model parameters with the recognised transcriptions is applied iteratively. The word error rates on two American Broadcast News test sets rise by only 14.6% and by 16.6%, respectively, in comparison with a fully tuned speech recognition system trained on 72 hours of manually transcribed data. Finally, two new sentence hypothesis scoring approaches are presented. Both of these approaches are based on word posterior probabilities. In the first approach which still aims at minimising the expected number of sentence errors, the word posterior probabilities are used to replace the acoustic and language model probabilities during the scoring algorithm. Using this method, the word error rates are reduced by between 1.5% and 5.1% relative on the five speech corpora used in this thesis. In the second approach, the expected number of word errors is minimised explicitly instead of the expected number of sentence errors. To this end, a cost function is used which is based on the observation that the identity of words cannot only be compared on the basis of a Levenshtein-Alignment, but also on the basis of points in time. With this new cost function, an efficient decision rule is derived which can be evaluated very elegantly and which makes use of the word posterior probabilities. The word error rates on the different testing corpora are reduced consistently with this new decision rule by 2.3% to 5.1% relative.
展开▼