首页> 外文OA文献 >Speaker normalisation for large vocabulary multiparty conversational speech recognition
【2h】

Speaker normalisation for large vocabulary multiparty conversational speech recognition

机译:说话人归一化,用于大型词汇多方对话语音识别

摘要

One of the main problems faced by automatic speech recognition is the variability ofudthe testing conditions. This is due both to the acoustic conditions (different transmissionudchannels, recording devices, noises etc.) and to the variability of speechudacross different speakers (i.e. due to different accents, coarticulation of phonemesudand different vocal tract characteristics). Vocal tract length normalisation (VTLN)udaims at normalising the acoustic signal, making it independent from the vocal tractudlength. This is done by a speaker specific warping of the frequency axis parameterisedudthrough a warping factor. In this thesis the application of VTLN to multipartyudconversational speech was investigated focusing on the meeting domain. Thisudis a challenging task showing a great variability of the speech acoustics both acrossuddifferent speakers and across time for a given speaker. VTL, the distance betweenudthe lips and the glottis, varies over time. We observed that the warping factors estimatedudusing Maximum Likelihood seem to be context dependent: appearing to beudinfluenced by the current conversational partner and being correlated with the behaviourudof formant positions and the pitch. This is because VTL also influences theudfrequency of vibration of the vocal cords and thus the pitch. In this thesis we alsoudinvestigated pitch-adaptive acoustic features with the goal of further improving theudspeaker normalisation provided by VTLN.udWe explored the use of acoustic features obtained using a pitch-adaptive analysisudin combination with conventional features such as Mel frequency cepstral coefficients.udThese spectral representations were combined both at the acoustic featureudlevel using heteroscedastic linear discriminant analysis (HLDA), and at the systemudlevel using ROVER. We evaluated this approach on a challenging large vocabularyudspeech recognition task: multiparty meeting transcription. We found that VTLNudbenefits the most from pitch-adaptive features. Our experiments also suggested thatudcombining conventional and pitch-adaptive acoustic features using HLDA results inuda consistent, significant decrease in the word error rate across all the tasks. Combiningudat the system level using ROVER resulted in a further significant improvement.udFurther experiments compared the use of pitch adaptive spectral representation withudthe adoption of a smoothed spectrogram for the extraction of cepstral coefficients.udIt was found that pitch adaptive spectral analysis, providing a representation whichudis less affected by pitch artefacts (especially for high pitched speakers), delivers features with an improved speaker independence. Furthermore this has also shown toudbe advantageous when HLDA is applied. The combination of a pitch adaptive spectraludrepresentation and VTLN based speaker normalisation in the context of LVCSRudfor multiparty conversational speech led to more speaker independent acoustic modelsudimproving the overall recognition performances.
机译:自动语音识别面临的主要问题之一是测试条件的可变性。这既是由于声学条件(不同的传输 udchannel,记录设备,噪声等),也由于语音的可变性不同说话者之间的差异(即由于不同的口音,音素的共发音 ud和不同的声道​​特性)。声道长度归一化(VTLN) udaim旨在对声音信号进行归一化,使其独立于声道 udlength。这是通过扬声器的特定翘曲来完成的,该特定的翘曲是通过翘曲系数进行参数化/设置的。本文以会议领域为研究对象,探讨了VTLN在多方对话语音中的应用。这是一项具有挑战性的任务,对于给定的发言人而言,跨不同的说话者以及跨时间的语音声学变化都很大。 VTL(嘴唇和声门之间的距离)随时间变化。我们观察到,估计的翘曲最大因数的翘曲因子似乎与上下文有关:似乎受到当前会话伙伴的 ud影响,并且与共振峰位置和音调的行为 ud相关。这是因为VTL还影响声带振动的频率,从而影响音高。在本文中,我们还 ud研究了音高自适应声学特征,以进一步改善VTLN提供的 udppeaker归一化。 ud我们探索了将音高自适应分析 udin与常规特征(例如Mel)结合使用所获得的声学特征的用途这些频谱表示在声学特征 udlevel上使用异方差线性判别分析(HLDA)进行了组合,在系统 udlevel上使用ROVER进行了组合。我们在具有挑战性的大词汇 udspeech识别任务上评估了这种方法:多方会议抄录。我们发现,VTLN ud受益于音高自适应功能。我们的实验还表明,, / / / / / / / ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////·/////////使用ROVER将系统级别组合起来可以带来进一步的显着改善。 ud进一步的实验比较了使用音调自适应频谱表示和采用平滑频谱图提取倒频谱系数的情况。提供的音调假象的影响较小(尤其是高音调扬声器),它提供的功能具有增强的扬声器独立性。此外,当应用HLDA时,这也显示为有利。在多方对话语音的LVCSR ud的背景下,音调自适应频谱 udrepresentation和基于VTLN的说话人归一化的组合导致了更多的说话人独立声学模型 udim改善了整体识别性能。

著录项

  • 作者

    Garau Giulia;

  • 作者单位
  • 年度 2009
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"en","name":"English","id":9}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号