In this work a number of novel techniques for improved treatment of spontaneous speech variabilities in large vocabulary automatic speech recognition are developed and evaluated on US English conversational speech and spontaneous medical dictations. Two main aspects of spontaneous speech modeling are addressed: The general handling of pronunciation variability and the individual and parallel treatment of multiple speech variabilities in the acoustic and pronunciation model of a one-pass speech recognizer.The problem of an optimal incorporation of multiple alternative pronunciations into the search framework is addressed in the first part of the thesis. This includes the question of how to efficiently combine the probabilistic contributions of alternative pronunciations in the course of a left to right search procedure. The well known maximum approximation, usually applied in this context, is compared to a novel time synchronous sum approximation technique which integrates alternative pronunciations in a weighted sum of acoustic probabilities. It is shown on a conversational speech task that this approach outperforms the maximum approximation by 2% relative and reduces the search costs by 7%.Another important issue with respect to the incorporation of alternative pronunciations into the search framework is the statistical weighting of the pronunciations. The usually applied pronunciation unigram prior probabilities are typically estimated by the relative frequencies of pronunciations in the training hypotheses. This standard maximum likelihood solution is compared to a novel discriminative training scheme which is an extension of the Discriminative Model Combination technique, proposed in [Beyerlein 01]. The developed iterative reestimation procedure is shown to adjust the influence of a specific pronunciation prior probability in the discriminant function in dependence of (1) the word error rate, (2) the frequency of occurrence of this pronunciation in the correct hypothesis and its rivals, and (3) the underlying acoustic, pronunciation and language model. An evaluation of this technique on a conversational speech task showed a 6.5% relative improvement on the training corpus and a 2% relative gain on an independent test set.The second major part of this thesis addresses the development and evaluation of a novel training and search framework which enables a specific, parallel treatment of multiple speech variabilities in the acoustic and pronunciation model. This technique (1) classifies portions of speech (e.g. words) with respect to given variability classes (e.g. rate of speech), (2) builds class specific acoustic and pronunciation models, and (3) properly combines these models later in the search procedure on a word level basis. A theoretical framework for an efficient integration of the class specific acoustic and pronunciation models into a one-pass search procedure is developed which incorporates contributions from class specific alternatives in a weighted sum of acoustic probabilities. This multi variability framework applies a very general model combination technique which may be applied to combine arbitrary acoustic and pronunciation models on word level. In this work, it is especially used for a parallel, explicit treatment of three important spontaneous speechvariabilities: pronunciation variability, rate of speech variability, and filled pause variability. The best multi variability system combines 6 class specific acoustic and pronunciation models on word level and achieves a word error rate reduction of 13% relative on a highly spontaneous medical dictation task and a gain of 9% relative on conversational speech.
展开▼