首页> 外文期刊>IEEE transactions on audio, speech and language processing >A bidirectional target-filtering model of speech coarticulation and reduction: two-stage implementation for phonetic recognition
【24h】

A bidirectional target-filtering model of speech coarticulation and reduction: two-stage implementation for phonetic recognition

机译:语音协同表达和减少的双向目标过滤模型:语音识别的两阶段实现

获取原文
获取原文并翻译 | 示例

摘要

A structured generative model of speech coarticulation and reduction is described with a novel two-stage implementation. At the first stage, the dynamics of formants or vocal tract resonances (VTRs) in fluent speech is generated using prior information of resonance targets in the phone sequence, in absence of acoustic data. Bidirectional temporal filtering with finite-impulse response (FIR) is applied to the segmental target sequence as the FIR filter's input, where forward filtering produces anticipatory coarticulation and backward filtering produces regressive coarticulation. The filtering process is shown also to result in realistic resonance-frequency undershooting or reduction for fast-rate and low-effort speech in a contextually assimilated manner. At the second stage, the dynamics of speech cepstra are predicted analytically based on the FIR-filtered and speaker-adapted VTR targets, and the prediction residuals are modeled by Gaussian random variables with trainable parameters. The combined system of these two stages, thus, generates correlated and causally related VTR and cepstral dynamics, where phonetic reduction is represented explicitly in the hidden resonance space and implicitly in the observed cepstral space. We present details of model simulation demonstrating quantitative effects of speaking rate and segment duration on the magnitude of reduction, agreeing closely with experimental measurement results in the acoustic-phonetic literature. This two-stage model is implemented and applied to the TIMIT phonetic recognition task. Using the N-best (N=2000) rescoring paradigm, the new model, which contains only context-independent parameters, is shown to significantly reduce the phone error rate of a standard hidden Markov model (HMM) system under the same experimental conditions.
机译:通过新颖的两阶段实现描述了语音协同表达和还原的结构化生成模型。在第一阶段,在没有声学数据的情况下,使用电话序列中共振目标的先验信息来生成流畅语音中共振峰或声道共振(VTR)的动态。具有有限冲激响应(FIR)的双向时间滤波被应用到分段目标序列作为FIR滤波器的输入,其中前向滤波产生预期的共发音,而后向滤波产生回归的共发音。还示出了滤波过程以在上下文被同化的方式下导致针对快速和低强度语音的实际共振频率下冲或降低。在第二阶段,基于经过FIR滤波和说话者自适应的VTR目标,对语音倒谱的动力学进行了分析预测,并使用具有可训练参数的高斯随机变量对预测残差进行建模。因此,这两个阶段的组合系统生成了相关的和因果相关的VTR和倒谱动力学,其中语音还原在隐藏的共振空间中明确表示,而在观察到的倒谱空间中隐式表示。我们将提供模型仿真的详细信息,以说明语音速率和段持续时间对减少幅度的定量影响,与声学文献中的实验测量结果非常吻合。此两阶段模型已实现并应用于TIMIT语音识别任务。使用N最佳(N = 2000)记录范例,该新模型仅包含上下文无关的参数,可在相同的实验条件下显着降低标准隐马尔可夫模型(HMM)系统的电话错误率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号