首页> 外文OA文献 >Investigations on linear transformations for speaker adaptation and normalization
【2h】

Investigations on linear transformations for speaker adaptation and normalization

机译:用于说话人自适应和归一化的线性变换的研究

摘要

This thesis deals with linear transformations at various stages of the automatic speech recognition process. In current state-of-the-art speech recognition systems linear transformations are widely used to care for a potential mismatch of the training and testing data and thus enhance the recognition performance. A large number of approaches has been proposed in literature, though the connections between them have been disregarded so far. By developing a unified mathematical framework, close relationships between the particular approaches are identified and analyzed in detail. Mel frequency Cepstral coefficients (MFCC) are commonly used features for automatic speech recognition systems. The traditional way of computing MFCCs suffers from a twofold smoothing, which complicates both the MFCC computation and the system optimization. An improved approach is developed that does not use any filter bank and thus avoids the twofold smoothing. This integrated approach allows a very compact implementation and needs less parameters to be optimized. Starting from this new computation scheme for MFCCs, it is proven analytically that vocal tract normalization (VTN) equals a linear transformation in the Cepstral space for arbitrary invertible warping functions. The transformation matrix for VTN is explicitly calculated exemplary for three commonly used warping functions. Based on some general characteristics of typical VTN warping functions, a common structure of the transformation matrix is derived that is almost independent of the specific functional form of the warping function. By expressing VTN as a linear transformation it is possible, for the first time, to take the Jacobian determinant of the transformation into account for any warping function. The effect of considering the Jacobian determinant on the warping factor estimation is studied in detail. The second part of this thesis deals with a special linear transformation for speaker adaptation, the Maximum Likelihood Linear Regression (MLLR) approach. Based on the close interrelationship between MLLR and VTN proven in the first part, the general structure of the VTN matrix is adopted to restrict the MLLR matrix to a band structure, which significantly improves the MLLR adaptation for the case of limited available adaptation data. Finally, several enhancements to MLLR speaker adaptation are discussed. One deals with refined definitions of regression classes, which is of special importance for fast adaptation when only limited adaptation data are available. Another enhancement makes use of confidence measures to care for recognition errors that decrease the adaptation performance in the first pass of a two-pass adaptation process.
机译:本文研究了自动语音识别过程各个阶段的线性变换。在当前最先进的语音识别系统中,线性变换被广泛用于护理训练和测试数据的潜在失配,从而提高了识别性能。文献中已经提出了大量方法,尽管到目前为止它们之间的联系都被忽略了。通过建立统一的数学框架,可以详细识别和分析特定方法之间的紧密关系。梅尔频率倒谱系数(MFCC)是自动语音识别系统的常用功能。传统的计算MFCC的方法遭受双重平滑,这使MFCC计算和系统优化都变得复杂。开发了一种改进的方法,该方法不使用任何滤波器组,从而避免了双重平滑。这种集成的方法允许非常紧凑的实现,并且需要较少的参数进行优化。从针对MFCC的这种新的计算方案开始,通过分析证明,对于任意可逆翘曲函数,声道归一化(VTN)等于在倒谱空间中的线性变换。针对三个常用的翘曲函数,明确计算出了VTN的转换矩阵。基于典型的VTN翘曲函数的一些一般特征,可以得出变换矩阵的通用结构,该结构几乎与翘曲函数的特定函数形式无关。通过将VTN表示为线性变换,可以首次将变换的雅可比行列式考虑到任何翘曲函数。详细研究了考虑雅可比行列式对翘曲因子估计的影响。本文的第二部分讨论了用于说话人自适应的特殊线性变换,即最大似然线性回归(MLLR)方法。基于在第一部分中证明的MLLR和VTN之间紧密的相互关系,采用VTN矩阵的一般结构将MLLR矩阵限制在一个带结构中,这在可用适应性数据有限的情况下显着提高了MLLR的适应性。最后,讨论了MLLR扬声器自适应的一些增强功能。一种处理回归类的精细定义,当只有有限的适应性数据可用时,这对于快速适应性特别重要。另一个增强功能是使用置信度度量值来护理识别错误,这些错误会在两遍自适应过程的第一遍中降低自适应性能。

著录项

  • 作者

    Pitz Michael;

  • 作者单位
  • 年度 2005
  • 总页数
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号