...
首页> 外文期刊>Computer speech and language >Vocal Tract Length Normalization using a Gaussian mixture model framework for query-by-example spoken term detection
【24h】

Vocal Tract Length Normalization using a Gaussian mixture model framework for query-by-example spoken term detection

机译:使用高斯混合模型框架进行语音片段长度归一化,以示例查询口语术语

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

A speech spectrum is known to be changed by the variations in the length of the vocal tract of a speaker. This is because of the fact that speech formants are inversely related to the vocal tract length (VTL). The process of compensating spectral variation due to the length of the vocal tract is known as Vocal Tract Length Normalization (VTLN). VTLN is a very important speaker normalization technique for speech recognition and related tasks. In this paper, we used Gaussian Posteriorgram (GP) of VTL-warped spectral features for a Query-by-Example Spoken Term Detection (QbE-STD) task. This paper presents the use of a Gaussian Mixture Model (GMM) framework for VTLN warping factor estimation. In particular, the presented GMM framework does not require phoneme-level transcription. We observed the correlation between the VTLN warping factor estimates obtained via a supervised HMM-based approach and an unsupervised GMM-based approach. In addition, a phoneme recognition and speaker de-identification tasks were conducted using GMM-based VTLN warping factor estimates. For QbE-STD, we considered three spectral features, namely, Mel Frequency Cepstral Coefficients (MFCC), Perceptual Linear Prediction (PLP), and MFCC-TMP (which uses Teager Energy Operator (TEO) to exploit implicitly magnitude and phase information in the MFCC framework). Linear frequency scaling variations for VTLN warping factor are incorporated into these three cepstral representations for the QbE-STD task. Similarly, VTL-warped Gaussian posteriorgram improved the Maximum Term Weighted Value by 0.021 (i.e., 2.1%), and 0.015 (i.e., 1.5%), for MFCC and PLP feature sets, respectively, on the evaluation set of the MediaEval SWS 2013 corpus. The better performance is primarily due to VTLN warping factor estimation using unsupervised GMM framework. Finally, the effectiveness of the proposed VTL-warped GP is presented to rescore using various detection sources, such as depth of detection valley, Self-Similarity Matrix, Pseudo Relevance Feedback and weighted mean features. (C) 2019 Elsevier Ltd. All rights reserved.
机译:已知语音频谱通过说话者的声道长度的变化而改变。这是因为语音共振峰与声道长度(VTL)成反比。补偿由于声道长度而引起的频谱变化的过程称为声带长度归一化(VTLN)。 VTLN是用于语音识别和相关任务的非常重要的说话人归一化技术。在本文中,我们将VTL扭曲的频谱特征的高斯后验图(GP)用于按示例查询口语词检测(QbE-STD)任务。本文介绍了使用高斯混合模型(GMM)框架进行VTLN翘曲因子估计的方法。特别是,提出的GMM框架不需要音素级别的转录。我们观察到通过基于HMM的有监督方法和基于GMM的无监督方法获得的VTLN翘曲因子估计之间的相关性。此外,使用基于GMM的VTLN翘曲因子估计来执行音素识别和说话者取消识别任务。对于QbE-STD,我们考虑了三个频谱特征,即梅尔频率倒谱系数(MFCC),感知线性预测(PLP)和MFCC-TMP(使用Teager能量算子(TEO)隐含地利用振幅和相位信息) MFCC框架)。 VTLN翘曲因子的线性频率缩放比例变化已合并到QbE-STD任务的这三个倒谱表示中。同样,在MediaEval SWS 2013语料库的评估集上,对于MFCC和PLP功能集,VTL扭曲的高斯后验图分别将最大项加权值提高了0.021(即2.1%)和0.015(即1.5%)。 。更好的性能主要归因于使用无监督GMM框架的VTLN翘曲因子估计。最后,使用各种检测源(如检测谷深度,自相似矩阵,伪相关反馈和加权均值特征)对提出的VTL扭曲GP的有效性进行了重新评估。 (C)2019 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号