首页> 外文期刊>Pattern recognition letters >Hierarchical Bayesian combination of plug-in maximum a posteriori decoders in deep neural networks-based speech recognition and speaker adaptation
【24h】

Hierarchical Bayesian combination of plug-in maximum a posteriori decoders in deep neural networks-based speech recognition and speaker adaptation

机译:基于深度神经网络的语音识别和说话人自适应的插件最大后验解码器的分层贝叶斯组合

获取原文
获取原文并翻译 | 示例
           

摘要

We propose a novel decoding framework by dynamically combining K multiple plug-in maximum a posteriori (MAP) decoders, with each solving for a sequence of symbols in a state-by-state manner in time and according to a set of constraints on the symbol sequences in space. The score combination occurs at the state level with the set of K combination weights either chosen to be equal (i.e., equal weighting scheme) or learned from a collection of data through a hierarchical Bayesian setting. When applied to automatic speech recognition (ASR), leveraging upon some characteristic differences in computing acoustic probabilities with both feed-forward deep neural networks (DNNs) and Gaussian mixture models (GMMs) at the hidden Markov phone state level, these scores can be discriminatively combined in plug-in MAP decoding. The DNN and GMM parameters can be trained from a large collection of speaker-independent (SI) speech data and further refined with a small set of speaker adaptation (SA) utterances. The per-speaker, per-state combination weights can be learned from SA data through the proposed hierarchical Bayesian approach. Experimental results on the Switchboard ASR task show that an ad hoc fixed-weight combination already reduces the word error rate (WER) to 16.9% from a SI WER of 17.4%. Model adaptation with 20 utterances can reduce the WER to 16.7%, which is further reduced to 16.1% using the SA models and fixed-weight combination decoding. The best WER of 15.3% is attained by using the proposed hierarchical Bayesian learned weights combining the two SA and two SI systems. Finally, we contrast the proposed technique with a state-of-the-art static system combination approach based on multiple word lattices generated by different ASR systems, and minimum Bayes risk. The experimental results demonstrate that static system combination cannot boost system performance of the individual systems, and the proposed dynamic combination scheme is needed. (C) 2017 Elsevier B.V. All rights reserved.
机译:我们通过动态组合K个最大插入后验(MAP)解码器,提出了一种新颖的解码框架,每个解码器都按照时间和状态对符号的一组约束按状态逐个求解符号序列空间中的序列。分数组合发生在状态级别上,其中K个组合权重的集合被选择为相等(即相等的加权方案)或通过分层贝叶斯设置从数据集合中获悉。当应用于自动语音识别(ASR)时,利用隐马尔可夫电话状态级别的前馈深层神经网络(DNN)和高斯混合模型(GMM)来计算声学概率时,可以利用一些特征差异来区分这些分数结合在插件MAP解码中。 DNN和GMM参数可以从大量与说话者无关的(SI)语音数据集中进行训练,并可以通过少量说话者适应(SA)话语进一步完善。可以通过提出的分层贝叶斯方法从SA数据中了解每个发言人,每个州的组合权重。在Switchboard ASR任务上的实验结果表明,临时固定权重组合已将单词错误率(WER)从17.4%的SI WER降低到16.9%。具有20种话语的模型自适应可以将WER降低到16.7%,使用SA模型和固定权重组合解码可以将WER进一步降低到16.1%。通过结合两个SA和两个SI系统使用建议的分层贝叶斯学习权重,可以达到15.3%的最佳WER。最后,我们将提出的技术与基于不同ASR系统生成的多个单词晶格和最小贝叶斯风险的最新静态系统组合方法进行对比。实验结果表明,静态系统组合不能提高单个系统的系统性能,因此需要提出的动态组合方案。 (C)2017 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号