In this paper we present a bimodal speech recognition system in which the audio and visual modalities are mdoeled and integrated using coupled hidden Markov models (CHMMs). CHMMs are probabilistic inference graphs that have hidden Markov models as sub-graphs. Chains in the corresponding inference graph are coupled through matrices of conditional probabilities mdoeling temporal influences between their hidden state variables. The coupled through matrices of conditional probabilities modeling temporal influences between their hidden state variables. The coupling probabilities are both cross chain and cross time. the later is essential for allowing temporal influences between chains, which is important in modeling bimodal speech. Our bimodal speech recognition system employs a twoo-chain CHMM< with one chain being associated with the acoustic observation, the other with the visual features. A deterministic approxiamtion for maximum a posteriori (MAP) esttimation is used to enable fast classification and parameter estiamtion. We evaluted the system on a speaker independent connected-digit task. Comparing with an acoustic-only ASR sytem trained using only the audio channel of the same database, the bimodal system consistently demonstrates improved noise robustness at all SNRs. We further compare the CHMM system reported in this paper with our earlier bimodal speech recognition system in which the two modalities are fused by concatenating the audio and visual features. The recognition resutls clearly show the advantages of the CHMM framework in the context of bimodal speech recognition.
展开▼