Hidden Markov models (HMMs) are known to model the duration of sound units poorly. We present a technique to normalize the duration of each phone to overcome this weakness, with the conjecture that speech with normalized phone durations may be better modeled and discriminated using standard HMM acoustic models. Duration normalization is accomplished by dropping frames if a phone is longer than the desired duration and by adding "missing" frames and reconstructing them if a phone is shorter than the desired duration. If phone segmentations are known a priori, we achieve a 15.8% reduction in relative word error rate (WER) on spontaneous speech and a 10.3% reduction in relative WER on read speech. Preliminary work with automatic phone segmentations derived from the data is also presented.
展开▼