首页> 外文期刊>Journal of signal processing systems for signal, image, and video technology >A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech
【24h】

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

机译:基于说话者的基于深度神经网络的单通道联合语音分离和声学建模方法,用于多语音对话的鲁棒识别

获取原文
获取原文并翻译 | 示例
       

摘要

We propose a novel speaker-dependent (SD) multi-condition (MC) training approach to joint learning of deep neural networks (DNNs) of acoustic models and an explicit speech separation structure for recognition of multi-talker mixed speech in a single-channel setting. First, an MC acoustic modeling framework is established to train a SD-DNN model in multi-talker scenarios. Such a recognizer significantly reduces the decoding complexity and improves the recognition accuracy over those using speaker-independent DNN models with a complicated joint decoding structure assuming the speaker identities in mixed speech are known. In addition, a SD regression DNN for mapping the acoustic features of mixed speech to the speech features of a target speaker is jointly trained with the SD-DNN based acoustic models. Experimental results on Speech Separation Challenge (SSC) small-vocabulary recognition show that the proposed approach under multi-condition training achieves an average word error rate (WER) of 3.8%, yielding a relative WER reduction of 65.1% from a top performance, DNN-based pre-processing only approach we proposed earlier under clean-condition training (Tu et al. 2016). Furthermore, the proposed joint training DNN framework generates a relative WER reduction of 13.2% from state-of-the-art systems under multi-condition training. Finally, the effectiveness of the proposed approach is also verified on the Wall Street Journal (WSJ0) task with medium-vocabulary continuous speech recognition in a simulated multi-talker setting.
机译:我们提出了一种新颖的说话人相关(SD)多条件(MC)训练方法,用于声学模型的深层神经网络(DNN)的联合学习,以及一种明确的语音分离结构,用于在单通道中识别多说话者混合语音设置。首先,建立了MC声学建模框架,以在多通话者场景中训练SD-DNN模型。假设已知混合语音中的说话者身份,这种识别器与使用具有复杂联合解码结构的独立于说话者的DNN模型相比,可大大降低解码复杂度并提高识别精度。另外,用于将混合语音的声学特征映射到目标说话者的语音特征的SD回归DNN与基于SD-DNN的声学模型共同训练。语音分离挑战(SSC)小词汇识别的实验结果表明,该方法在多条件训练下的平均单词错误率(WER)为3.8%,相对于最佳性能DNN而言,相对WER降低了65.1%基于预处理的方法,我们之前在清洁条件培训下提出了这种方法(Tu等人,2016年)。此外,提出的联合训练DNN框架在多条件训练下与最新系统相比可使WER降低13.2%。最后,在模拟的多方通话设置中,具有中等词汇量连续语音识别功能的《华尔街日报》(WSJ0)任务也验证了该方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号