...
首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks
【24h】

A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks

机译:高分辨率深度神经网络的单通道语音分离的回归方法

获取原文
获取原文并翻译 | 示例
           

摘要

We propose a novel data-driven approach to single-channel speech separation based on deep neural networks (DNNs) to directly model the highly nonlinear relationship between speech features of a mixed signal containing a target speaker and other interfering speakers. We focus our discussion on a semisupervised mode to separate speech of the target speaker from an unknown interfering speaker, which is more flexible than the conventional supervised mode with known information of both the target and interfering speakers. Two key issues are investigated. First, we propose a DNN architecture with dual outputs of the features of both the target and interfering speakers, which is shown to achieve a better generalization capability than that with output features of only the target speaker. Second, we propose using a set of multiple DNNs, each intending to be signal-noise-dependent (SND), to cope with the difficulty that one single general DNN could not well accommodate all the speaker mixing variabilities at different signal-to-noise ratio (SNR) levels. Experimental results on the speech separation challenge (SSC) data demonstrate that our proposed framework achieves better separation results than other conventional approaches in a supervised or semisupervised mode. SND-DNNs could also yield significant performance improvements over a general DNN for speech separation in low SNR cases. Furthermore, for automatic speech recognition (ASR) following speech separation, this purely front-end processing with a single set of speaker-independent ASR acoustic models, achieves a relative word error rate (WER) reduction of 11.6% over a state-of-the-art separation and recognition system where a complicated joint back-end decoding framework with multiple sets of speaker-dependent ASR acoustic models needs to be implemented. When speaker-adaptive ASR acoustic models for the target speakers are adopted for the enhanced signals, another 12.1% WER reduction over our best speak- r-independent ASR system is achieved.
机译:我们提出了一种基于数据驱动的基于深度神经网络(DNN)的单通道语音分离方法,可以直接对包含目标说话者和其他干扰说话者的混合信号的语音特征之间的高度非线性关系进行建模。我们将讨论的重点放在半监督模式下,以将目标讲话者的语音与未知干扰讲话者分离,这比具有目标和干扰讲话者已知信息的传统监督模式更加灵活。研究了两个关键问题。首先,我们提出了一种具有目标和干扰说话者特征的双重输出的DNN体系结构,与仅具有目标说话者的输出特征相比,它具有更好的泛化能力。其次,我们建议使用一组多个DNN,每个DNN都希望与信号噪声相关(SND),以解决一个单一的通用DNN无法很好地适应不同信噪比下所有扬声器混音变化的难题。比(SNR)级别。关于语音分离挑战(SSC)数据的实验结果表明,我们提出的框架在监督或半监督模式下比其他常规方法可获得更好的分离结果。在低SNR情况下,SND-DNN在语音分离方面也可比普通DNN显着提高性能。此外,对于语音分离后的自动语音识别(ASR),此纯净的前端处理过程具有一组与扬声器无关的ASR声学模型,相对于语音状态,其相对字错误率(WER)降低了11.6%。现有技术的分离和识别系统,其中需要实现一个复杂的联合后端解码框架,其中包含多组与说话者相关的ASR声学模型。当针对目标扬声器的扬声器自适应ASR声学模型用于增强信号时,与我们独立于最佳扬声器的最佳ASR系统相比,WER降低了12.1%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号