In speech recognition there has been a trend to incorporate more and more knowledge about human hearing into the feature extraction step. One such approach is the application of localized spectro-temporal analysis, which is inspired by neurophysiological studies. Here we experiment with extracting features from the patches of the widely used criticial-band log-energy spectrum by applyingthe two-dimensional cosine transform. Compared to earlier similar studies with the spectrogram representation, we find that our method is not worse, and faster. In experiments with noisy speech the proposed representation proves more noise-robust than the conventional mel-frequency cepstral features.
展开▼