Hierarchical Bayesian combination of plug-in maximum a posteriori decoders in deep neural networks-based speech recognition and speaker adaptation

Huang Zhen; Siniscalchi Sabato Marco; Lee Chin-Hui

首页> 外文期刊>Pattern recognition letters >Hierarchical Bayesian combination of plug-in maximum a posteriori decoders in deep neural networks-based speech recognition and speaker adaptation

【24h】

Hierarchical Bayesian combination of plug-in maximum a posteriori decoders in deep neural networks-based speech recognition and speaker adaptation

机译：基于深度神经网络的语音识别和说话人自适应的插件最大后验解码器的分层贝叶斯组合

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We propose a novel decoding framework by dynamically combining K multiple plug-in maximum a posteriori (MAP) decoders, with each solving for a sequence of symbols in a state-by-state manner in time and according to a set of constraints on the symbol sequences in space. The score combination occurs at the state level with the set of K combination weights either chosen to be equal (i.e., equal weighting scheme) or learned from a collection of data through a hierarchical Bayesian setting. When applied to automatic speech recognition (ASR), leveraging upon some characteristic differences in computing acoustic probabilities with both feed-forward deep neural networks (DNNs) and Gaussian mixture models (GMMs) at the hidden Markov phone state level, these scores can be discriminatively combined in plug-in MAP decoding. The DNN and GMM parameters can be trained from a large collection of speaker-independent (SI) speech data and further refined with a small set of speaker adaptation (SA) utterances. The per-speaker, per-state combination weights can be learned from SA data through the proposed hierarchical Bayesian approach. Experimental results on the Switchboard ASR task show that an ad hoc fixed-weight combination already reduces the word error rate (WER) to 16.9% from a SI WER of 17.4%. Model adaptation with 20 utterances can reduce the WER to 16.7%, which is further reduced to 16.1% using the SA models and fixed-weight combination decoding. The best WER of 15.3% is attained by using the proposed hierarchical Bayesian learned weights combining the two SA and two SI systems. Finally, we contrast the proposed technique with a state-of-the-art static system combination approach based on multiple word lattices generated by different ASR systems, and minimum Bayes risk. The experimental results demonstrate that static system combination cannot boost system performance of the individual systems, and the proposed dynamic combination scheme is needed. (C) 2017 Elsevier B.V. All rights reserved.

机译：我们通过动态组合K个最大插入后验（MAP）解码器，提出了一种新颖的解码框架，每个解码器都按照时间和状态对符号的一组约束按状态逐个求解符号序列空间中的序列。分数组合发生在状态级别上，其中K个组合权重的集合被选择为相等（即相等的加权方案）或通过分层贝叶斯设置从数据集合中获悉。当应用于自动语音识别（ASR）时，利用隐马尔可夫电话状态级别的前馈深层神经网络（DNN）和高斯混合模型（GMM）来计算声学概率时，可以利用一些特征差异来区分这些分数结合在插件MAP解码中。 DNN和GMM参数可以从大量与说话者无关的（SI）语音数据集中进行训练，并可以通过少量说话者适应（SA）话语进一步完善。可以通过提出的分层贝叶斯方法从SA数据中了解每个发言人，每个州的组合权重。在Switchboard ASR任务上的实验结果表明，临时固定权重组合已将单词错误率（WER）从17.4％的SI WER降低到16.9％。具有20种话语的模型自适应可以将WER降低到16.7％，使用SA模型和固定权重组合解码可以将WER进一步降低到16.1％。通过结合两个SA和两个SI系统使用建议的分层贝叶斯学习权重，可以达到15.3％的最佳WER。最后，我们将提出的技术与基于不同ASR系统生成的多个单词晶格和最小贝叶斯风险的最新静态系统组合方法进行对比。实验结果表明，静态系统组合不能提高单个系统的系统性能，因此需要提出的动态组合方案。（C）2017 Elsevier B.V.保留所有权利。

著录项

来源
《Pattern recognition letters》 |2017年第15期|1-7|共7页
作者
Huang Zhen; Siniscalchi Sabato Marco; Lee Chin-Hui;
展开▼
作者单位

Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA;

Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA|Kore Univ Enna, Fac Engn & Architecture, Enna, Sicily, Italy;

Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
System combination; Bayesian learning; Sequential patterns; Deep neural networks; Automatic speech recognition;

机译：系统组合;贝叶斯学习;顺序模式;深度神经网络;自动语音识别;

相似文献

外文文献
中文文献
专利

1. Bayesian Unsupervised Batch and Online Speaker Adaptation of Activation Function Parameters in Deep Models for Automatic Speech Recognition [J] . Zhen Huang, Sabato Marco Siniscalchi, Chin-Hui Lee Audio, Speech, and Language Processing, IEEE/ACM Transactions on . 2017,第1期

机译：用于语音识别的深度模型中激活函数参数的贝叶斯无监督批处理和在线说话者自适应
2. Discriminative Learning of Filterbank Layer within Deep Neural Network Based Speech Recognition for Speaker Adaptation [J] . Hiroshi SEKI, Kazumasa YAMAMOTO, Tomoyosi AKIBA, IEICE transactions on information and systems . 2019,第2期

机译：基于深度神经网络的说话人自适应语音识别的判别学习
3. A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition [J] . Huang Zhen, Siniscalchi Sabato Marco, Lee Chin-Hui Neurocomputing . 2016,第DECa19期

机译：深度神经网络转移学习的统一方法及其在自动语音识别中的说话人自适应中的应用
4. Unsupervised speaker adaptation of deep neural network based on the combination of speaker codes and singular value decomposition for speech recognition [C] . Xue Shaofei, Jiang Hui, Dai Lirong, IEEE International Conference on Acoustics, Speech and Signal Processing . 2015

机译：基于说话人代码和奇异值分解的组合的深度神经网络无监督说话人自适应语音识别
5. Dysarthric Speech Recognition and Offline Handwriting Recognition using Deep Neural Networks. [D] . Pillai, Suhas Balkrishna. 2017

机译：使用深度神经网络的表情异常语音识别和离线手写识别。
6. Hierarchical Neural Representation of Dreamed Objects Revealed by Brain Decoding with Deep Neural Network Features [O] . Tomoyasu Horikawa, Yukiyasu Kamitani 2017

机译：具有深度神经网络功能的大脑解码揭示了梦对象的分层神经表示
7. Maximum likelihood and maximum a posteriori adaptation for distributed speaker recognition systems [O] . Sit CH, Mak MW, Kung SY 2004

机译：分布式说话人识别系统的最大可能性和最大后验适应

Hierarchical Bayesian combination of plug-in maximum a posteriori decoders in deep neural networks-based speech recognition and speaker adaptation

摘要

著录项

相似文献

相关主题

期刊订阅