The purpose of this paper is to describe the realization of speech emotion recognition. Generally, text-independent mode has been utilized for speech emotion recognition, hence previous researches have discounted that emotion features vary according to the text or phonemes, though this can distort the classification performance. To overcome this distortion, a framework of speech emotion recognition is proposed based on segmentation of voiced and unvoiced sound. Voiced and unvoiced sound have different characteristic of emotion features as vocalization between voiced sound and unvoiced sound is much different hence, they should be considered separately. In this paper, voiced and unvoiced sound classification is performed using spectral flatness measures and the spectral center, and a Gaussian mixture model with five mixtures was employed for emotion recognition. To confirm the proposed framework, two systems are compared: the first is emotion classification using whole utterances (ordinary method) and the second uses segments of voiced and unvoiced sound (proposed method). The proposed approach yields higher classification rates compared to previous systems in both cases using each of the emotion features (linear prediction coding (LPC), Mel-frequency cepstral coefficients (MFCCs), perceptual linear prediction (PLP) and energy) as well as a combination of these four features.
展开▼