This thesis addresses an important research gap regarding effects of real-life conditions including coded, narrow-band and noisy speech signals on automatic emotion recognition (AER) from speech signals. In addition, the study aims to research efficient methods of reducing possible detrimental effects of speech signals compression on AER. The thesis consists of two parts. The first part investigates the effects of noise, data compression and bandwidth reduction on AER from speech signals. The second part investigates application of AER based on speech spectrograms (SS) and the Artificial Bandwidth Extension (ABE) to improve the robustness and accuracy of emotion recognition from speech signals under these potentially undesirable conditions. Effects of adaptive multi-rates (AMR), adaptive multi-rate wideband (AMR-WB) and extended adaptive multi-rate wideband (AMR-WB+) and MP3 speech compression methods are compared against emotion recognition from uncompressed speech. Noisy conditions are simulated using Gaussian white noise added to speech signals at different values of signal to noise ratio (SNR). Band reduction is tested using speech filtering. The AER methods include techniques based on acoustic speech parameters including: mel-frequency cepstral coefficients (MFCCs), Teager energy operator and perceptual wavelet packet (TEO-PWP) features, glottal time and frequency domain features (GP-T&GP-F), as well as, spectrogram image (SS) parameters, spectrogram critical band scale (SS-CB) and spectrogram bark scale (SS-Bark). The modelling of acoustic classes is based on the Gaussian Mixture Mode (GMM) and all experiments use the same Berlin Emotional Speech database. The ABE of narrow band speech is performed using spectral folding and spectral envelope estimation methods. The major findings described in this thesis indicate that: 1. Standard speech compression methods such as AMR, AMR-WB, AMR-WB+ and MP3 have a significant effect on the (AER), and in general lead to significant degradation of AER accuracy. 2. Low-frequency components (0 kHz to 1 kHz) of speech containing the fundamental frequency information, as well as, high-frequency components (above 4 kHz) have a key effect on the accuracy of SER. 3. Significant reduction of AER accuracy was observed for uncompressed speech modified in a way simulating a typical mild-to-moderate high frequency hearing loss. This accuracy was further reduced when the modified speech was compressed. 4. Addition of noise to either uncompressed or compressed speech reduces accuracy of AER. It was shown that the best performing under noisy conditions features were MFCCs and the best performing speech compression algorithms was AMR-WB. 5. Detrimental effects of speech compression can be mitigated using AER based on speech spectrogram features. 6. By extending the narrow-band of AMR-compressed speech an improvement of AER accuracy can be achieved.
展开▼