首页> 外文期刊>Computer speech and language >Vocal Tract Length Normalization using a Gaussian mixture model framework for query-by-example spoken term detection
【24h】

Vocal Tract Length Normalization using a Gaussian mixture model framework for query-by-example spoken term detection

机译:使用高斯混合模型框架进行查询逐期检测的高斯混合模型框架的声带长度标准化

获取原文
获取原文并翻译 | 示例
           

摘要

A speech spectrum is known to be changed by the variations in the length of the vocal tract of a speaker. This is because of the fact that speech formants are inversely related to the vocal tract length (VTL). The process of compensating spectral variation due to the length of the vocal tract is known as Vocal Tract Length Normalization (VTLN). VTLN is a very important speaker normalization technique for speech recognition and related tasks. In this paper, we used Gaussian Posteriorgram (GP) of VTL-warped spectral features for a Query-by-Example Spoken Term Detection (QbE-STD) task. This paper presents the use of a Gaussian Mixture Model (GMM) framework for VTLN warping factor estimation. In particular, the presented GMM framework does not require phoneme-level transcription. We observed the correlation between the VTLN warping factor estimates obtained via a supervised HMM-based approach and an unsupervised GMM-based approach. In addition, a phoneme recognition and speaker de-identification tasks were conducted using GMM-based VTLN warping factor estimates. For QbE-STD, we considered three spectral features, namely, Mel Frequency Cepstral Coefficients (MFCC), Perceptual Linear Prediction (PLP), and MFCC-TMP (which uses Teager Energy Operator (TEO) to exploit implicitly magnitude and phase information in the MFCC framework). Linear frequency scaling variations for VTLN warping factor are incorporated into these three cepstral representations for the QbE-STD task. Similarly, VTL-warped Gaussian posteriorgram improved the Maximum Term Weighted Value by 0.021 (i.e., 2.1%), and 0.015 (i.e., 1.5%), for MFCC and PLP feature sets, respectively, on the evaluation set of the MediaEval SWS 2013 corpus. The better performance is primarily due to VTLN warping factor estimation using unsupervised GMM framework. Finally, the effectiveness of the proposed VTL-warped GP is presented to rescore using various detection sources, such as depth of detection valley, Self-Similarity Matrix, Pseudo Relevance Feedback and weighted mean features. (C) 2019 Elsevier Ltd. All rights reserved.
机译:已知语音频谱通过扬声器的声道长度的变化来改变。这是因为语音格式与声带长度(VTL)反向相关的事实。由于声道的长度而补偿光谱变化的过程称为声带长度归一化(VTLN)。 VTLN是一种非常重要的语音识别和相关任务的扬声器标准化技术。在本文中,我们使用了VTL扭曲光谱特征的高斯后验仪(GP),以便逐个语言检测(QBE-STD)任务。本文介绍了用于VTLN翘曲因子估计的高斯混合模型(GMM)框架。特别是,所呈现的GMM框架不需要音素级转录。我们观察到通过受监督的基于HMM的方法获得的VTLN翘曲因子估计与无监督基于GMM的方法之间的相关性。此外,使用基于GMM的VTLN翘曲因子估计来进行音素识别和扬声器去识别任务。对于QBE-STD,我们考虑了三个光谱特征,即麦倍频谱系数(MFCC),感知线性预测(PLP)和MFCC-TMP(使用TEXGEN能量运算符(TEO)利用隐式幅度和相位信息MFCC框架)。 VTLN翘曲因子的线性频率缩放变化被纳入了QBE-STD任务的这三个谱表示。类似地,VTL扭曲高斯后验速度分别将最大术语加权值提高0.021(即,2.1%)和0.015(即,1.5%),分别用于MYFCC和PLP特征集,在MediaEval SWS 2013语料库的评估集上。更好的性能主要是由于VTLN翘曲因子估计使用无监督的GMM框架。最后,提出了所提出的VTL扭曲GP的有效性以使用各种检测源来重新核,例如检测谷的深度,自相似矩阵,伪相关反馈和加权平均特征。 (c)2019 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号