首页> 外文学位 >Frequency warping by linear transformation, and vocal tract inversion for speaker normalization in automatic speech recognition.
【24h】

Frequency warping by linear transformation, and vocal tract inversion for speaker normalization in automatic speech recognition.

机译:通过线性变换实现的频率扭曲和声道反转,可在自动语音识别中实现说话人归一化。

获取原文
获取原文并翻译 | 示例

摘要

Vocal Tract Length Normalization (VTLN) for standard filterbank-based Mel Frequency Cepstral Coefficient (MFCC) features is usually implemented by warping the center frequencies of the Mel filterbank, and the warping factor is estimated using the maximum likelihood score (MLS) criterion. A linear transform (LT) equivalent for frequency warping (FW) would enable more efficient MLS estimation. In this dissertation, we present a novel LT to perform FW for VTLN and model adaptation with standard MFCC features. Our formula for the transformation matrix is computationally simpler than previous LT approaches, with no required modification of the standard MFCC feature extraction scheme. In VTLN and Speaker Adaptive Modeling (SAM) experiments with the Resource Management (RMI) database, the performance of the new LT was comparable to that of regular VTLN by warping the Mel filterbank. This demonstrates that the approximations involved in the LT do not lead to any performance degradation. We also performed Speaker Adaptive Training (SAT) with feature space LT denoted CLTFW. Global CLTFW SAT gave results comparable to SAM and VTLN. By estimating multiple CLTFW transforms using a regression tree, and including an additive bias, we obtained significantly improved results compared to VTLN, with increasing adaptation data.;In the second part of the dissertation, vocal tract (VT) inversion to recover the VT shape sequence from speech signals is performed for vowels by cepstral analysis-by-synthesis, using chain-matrix calculation of VT acoustics and the Maeda articulatory model. The derivative of the VT chain matrix with respect to the area function was calculated in a novel efficient manner, and used in the BFGS quasi-Newton method for optimizing a cost function that includes a distance measure between input and synthesized cepstral sequences, and regularization and continuity terms. Inversion is evaluated on data from the University of Wisconsin X-ray microbeam (XRMB) database, and good agreement was achieved between inverted midsagittal VT outlines and measured XRMB tongue and lip pellet positions, with smooth optimized articulatory trajectories, and an average relative error of less than 3% in the first three formants.
机译:用于标准基于滤波器组的梅尔频率倒谱系数(MFCC)功能的人声道长度归一化(VTLN)通常是通过使Mel滤波器组的中心频率变形来实现的,并且使用最大似然得分(MLS)准则估算变形因子。等效于频率扭曲(FW)的线性变换(LT)将使MLS估算更加有效。在本文中,我们提出了一种新颖的LT来执行VTLN的固件和具有标准MFCC特征的模型适配。我们的变换矩阵公式在计算上比以前的LT方法更简单,无需修改标准MFCC特征提取方案。在资源管理(RMI)数据库的VTLN和说话人自适应建模(SAM)实验中,通过扭曲梅尔滤波器组,新LT的性能与常规VTLN相当。这表明LT中涉及的近似值不会导致任何性能下降。我们还执行了特征空间LT表示为CLTFW的说话者自适应训练(SAT)。全球CLTFW SAT的结果与SAM和VTLN相当。通过使用回归树估计多个CLTFW变换,并包括加性偏差,与VTLN相比,我们获得了明显改善的结果,并且适应性数据有所增加。;在论文的第二部分中,声道(VT)反转以恢复VT形状使用VT声学的链矩阵计算和前田咬合模型,通过倒频谱合成对语音进行元音语音序列。 VT链矩阵相对于面积函数的导数以一种新颖有效的方式进行计算,并用于BFGS拟牛顿法中,以优化成本函数,该函数包括输入和合成倒谱序列之间的距离度量以及正则化和连续性条款。利用威斯康星大学X射线微束(XRMB)数据库中的数据评估了反转,并且在矢状中VT反转轮廓与测量的XRMB舌唇边缘位置之间取得了良好的一致性,并具有流畅的优化关节运动轨迹,并且平均相对误差为前三个共振峰中小于3%。

著录项

  • 作者

    Panchapagesan, Sankaran.;

  • 作者单位

    University of California, Los Angeles.;

  • 授予单位 University of California, Los Angeles.;
  • 学科 Engineering Electronics and Electrical.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 108 p.
  • 总页数 108
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 无线电电子学、电信技术;
  • 关键词

  • 入库时间 2022-08-17 11:38:42

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号