One of the main problems faced by automatic speech recognition is the variability ofudthe testing conditions. This is due both to the acoustic conditions (different transmissionudchannels, recording devices, noises etc.) and to the variability of speechudacross different speakers (i.e. due to different accents, coarticulation of phonemesudand different vocal tract characteristics). Vocal tract length normalisation (VTLN)udaims at normalising the acoustic signal, making it independent from the vocal tractudlength. This is done by a speaker specific warping of the frequency axis parameterisedudthrough a warping factor. In this thesis the application of VTLN to multipartyudconversational speech was investigated focusing on the meeting domain. Thisudis a challenging task showing a great variability of the speech acoustics both acrossuddifferent speakers and across time for a given speaker. VTL, the distance betweenudthe lips and the glottis, varies over time. We observed that the warping factors estimatedudusing Maximum Likelihood seem to be context dependent: appearing to beudinfluenced by the current conversational partner and being correlated with the behaviourudof formant positions and the pitch. This is because VTL also influences theudfrequency of vibration of the vocal cords and thus the pitch. In this thesis we alsoudinvestigated pitch-adaptive acoustic features with the goal of further improving theudspeaker normalisation provided by VTLN.udWe explored the use of acoustic features obtained using a pitch-adaptive analysisudin combination with conventional features such as Mel frequency cepstral coefficients.udThese spectral representations were combined both at the acoustic featureudlevel using heteroscedastic linear discriminant analysis (HLDA), and at the systemudlevel using ROVER. We evaluated this approach on a challenging large vocabularyudspeech recognition task: multiparty meeting transcription. We found that VTLNudbenefits the most from pitch-adaptive features. Our experiments also suggested thatudcombining conventional and pitch-adaptive acoustic features using HLDA results inuda consistent, significant decrease in the word error rate across all the tasks. Combiningudat the system level using ROVER resulted in a further significant improvement.udFurther experiments compared the use of pitch adaptive spectral representation withudthe adoption of a smoothed spectrogram for the extraction of cepstral coefficients.udIt was found that pitch adaptive spectral analysis, providing a representation whichudis less affected by pitch artefacts (especially for high pitched speakers), delivers features with an improved speaker independence. Furthermore this has also shown toudbe advantageous when HLDA is applied. The combination of a pitch adaptive spectraludrepresentation and VTLN based speaker normalisation in the context of LVCSRudfor multiparty conversational speech led to more speaker independent acoustic modelsudimproving the overall recognition performances.
展开▼