An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Bo Wu; Kehuang Li; Fengpei Ge; Zhen Huang; Minglei Yang; Sabato Marco Siniscalchi; Chin-Hui Lee

首页> 外文期刊>Selected Topics in Signal Processing, IEEE Journal of >An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

【24h】

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

机译：端到端深度学习方法可同时进行语音去混响和声学建模，以实现可靠的语音识别

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that “only good signal processing can lead to top ASR performance” in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28% on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46% is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76% on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.

机译：通过联合学习前端语音信号处理和后端声学建模，我们提出了一种集成的端到端自动语音识别（ASR）范例。我们相信，在充满挑战的声学环境中，“只有良好的信号处理才能带来出色的ASR性能”。这个想法导致了用于远程语音处理的统一的深度神经网络（DNN）框架，该框架可以同时实现高质量的增强语音和高精度ASR。我们的目标是通过两种技术实现的，即：（i）基于混响时间感知DNN的语音混响架构，可以处理多种混响时间以提高混响和嘈杂语音的语音质量，其次是（ii）DNN-基于多条件的训练，它同时考虑了干净条件和多条件语音，并利用了利用多通道麦克风阵列获取和处理的数据来提高ASR性能。最终的端到端系统是通过语音增强和识别DNN的联合优化而建立的。最近的混响语音增强和识别基准（REVERB）挑战任务被用作评估我们提出的框架的测试平台。我们首先在2014年REVERB挑战研讨会上通过模拟数据测试集报告了增强语音中的卓越客观指标。此外，通过提出的基于DNN的预处理算法和干净条件训练，我们在1通道REVERB模拟数据上获得了13.28％的最佳单系统字错误率（WER）。利用具有更多判别性ASR功能和改进的基于神经网络的语言模型的联合训练，可以实现4.46％的低单系统WER。接下来，新的多通道条件联合学习和测试方案通过单个ASR系统在8通道模拟数据上提供了3.76％的最新WER。最后，我们还报告了有关REVERB真实测试数据的初步但有希望的实验。

著录项

来源
《Selected Topics in Signal Processing, IEEE Journal of》 |2017年第8期|1289-1300|共12页
作者
Bo Wu; Kehuang Li; Fengpei Ge; Zhen Huang; Minglei Yang; Sabato Marco Siniscalchi; Chin-Hui Lee;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Automatic speech recognition; Speech recognition; Training; Speech enhancement; Microphone arrays; Machine learning;

机译：自动语音识别;语音识别;培训;语音增强;麦克风阵列;机器学习;

相似文献

外文文献
中文文献
专利

1. A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech [J] . Yan-Hui Tu, Jun Du, Chin-Hui Lee Journal of signal processing systems for signal, image, and video technology . 2018,第7期

机译：基于说话者的基于深度神经网络的单通道联合语音分离和声学建模方法，用于多语音对话的鲁棒识别
2. Robust Speech Recognition Based on Dereverberation Parameter Optimization Using Acoustic Model Likelihood [J] . Gomez R., Kawahara T. Audio, Speech, and Language Processing, IEEE Transactions on . 2010,第7期

机译：基于声学模型似然性的去混响参数优化的鲁棒语音识别
3. Representation transfer learning from deep end-to-end speech recognition networks for the classification of health states from speech [J] . Benjamin Sertolli, Zhao Ren, Bjoern W. Schuller, Computer speech and language . 2021,第Jula期

机译：从言语中，从深端到端语音识别网络中的代表转移学习
4. A unified deep modeling approach to simultaneous speech dereverberation and recognition for the reverb challenge [C] . Bo Wu, Kehuang Li, Zhen Huang, 2017 Hands-free Speech Communications and Microphone Arrays . 2017

机译：统一的深度建模方法可同时进行语音去混响和识别混响挑战
5. Robust Acoustic Modeling and Front-End Design for Distant Speech Recognition [D] . Mirsamadi, Seyedmahdad. 2017

机译：鲁棒的声学建模和远端语音识别前端设计
6. Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training [O] . Arun Narayanan, DeLiang Wang -1

机译：通过语音分离和联合自适应训练提高深度神经网络声学模型的鲁棒性
7. Robust Speech Recognition Based on Dereverberation Parameter Optimization Using Acoustic Model Likelihood [O] . Gomez Randy, Kawahara Tatsuya 2010

机译：基于声学模型似然性的去混响参数优化的鲁棒语音识别

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅