首页> 外文期刊>Computer speech and language >Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend
【24h】

Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend

机译:多麦克风语音识别,集成了波束成形,强大的特征提取功能和先进的DNN / RNN后端

获取原文
获取原文并翻译 | 示例

摘要

This paper gives an in-depth presentation of the multi-microphone speech recognition system we submitted to the 3rd CHiME speech separation and recognition challenge (CHiME-3) and its extension. The proposed system takes advantage of recurrent neural networks (RNNs) throughout the model from the front-end speech enhancement to the language modeling. Three different types of beamforming are used to combine multi-microphone signals to obtain a single higher-quality signal. The beamformed signal is further processed by a single-channel long short-term memory (LSTM) enhancement network, which is used to extract stacked mel-frequency cepstral coefficients (MFCC) features. In addition, the beamformed signal is processed by two proposed noise-robust feature extraction methods. All features are used for decoding in speech recognition systems with deep neural network (DNN) based acoustic models and large-scale RNN language models to achieve high recognition accuracy in noisy environments. Our training methodology includes multi-channel noisy data training and speaker adaptive training, whereas at test time model combination is used to improve generalization. Results on the CHiME-3 benchmark show that the full set of techniques substantially reduced the word error rate (WER). Combining hypotheses from different beamforming and robust-feature systems ultimately achieved 5.05% WER for the real-test data, an 84.7% reduction relative to the baseline of 32.99% WER and a 44.5% reduction from our official CHiME-3 challenge result of 9.1% WER. Furthermore, this final result is better than the best result (5.8% WER) reported in the CHiME-3 challenge.
机译:本文深入介绍了我们提交给第三届CHiME语音分离与识别挑战赛(CHiME-3)及其扩展的多麦克风语音识别系统。从前端语音增强到语言建模,整个系统在整个模型中都利用了递归神经网络(RNN)。三种不同类型的波束成形用于组合多麦克风信号以获得单个更高质量的信号。波束形成的信号由单通道长短期记忆(LSTM)增强网络进一步处理,该网络用于提取堆叠的梅尔频率倒谱系数(MFCC)特征。另外,通过两种提出的噪声鲁棒特征提取方法处理波束形成的信号。所有功能都用于基于深度神经网络(DNN)的声学模型和大规模RNN语言模型的语音识别系统中的解码,以在嘈杂的环境中实现较高的识别精度。我们的训练方法包括多通道噪声数据训练和说话人自适应训练,而在测试时使用模型组合来提高泛化能力。 CHiME-3基准测试的结果表明,全套技术可大大降低字错误率(WER)。结合来自不同波束成形和稳健功能系统的假设,最终获得的真实测试数据的WER为5.05%,相对于基准的32.99%WER减少了84.7%,比我们的官方CHiME-3挑战结果9.1%减少了44.5%。 WER。此外,此最终结果要好于CHiME-3挑战中报告的最佳结果(5.8%WER)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号