首页> 外文学位 >On Generalization of Supervised Speech Separation
【24h】

On Generalization of Supervised Speech Separation

机译:有监督语音分离的一般化

获取原文
获取原文并翻译 | 示例

摘要

Speech is essential for human communication as it not only delivers messages but also expresses emotions. In reality, speech is often corrupted by background noise and room reverberation. Perceiving speech in low signal-to-noise ratio (SNR) conditions is challenging, especially for hearing-impaired listeners. Therefore, we are motivated to develop speech separation algorithms to improve intelligibility of noisy speech. Given its many applications, such as hearing aids and robust automatic speech recognition (ASR), speech separation has been an important problem in speech processing for decades.;Speech separation can be achieved by estimating the ideal binary mask (IBM) or ideal ratio mask (IRM). In a time-frequency (T-F) representation of noisy speech, the IBM preserves speech-dominant T-F units and discards noise-dominant ones. Similarly, the IRM adjusts the gain of each T-F unit to suppress noise. As such, speech separation can be treated as a supervised learning problem where one estimates the ideal mask from noisy speech. Three key components of supervised speech separation are learning machines, acoustic features and training targets. This supervised framework has enabled the treatment of speech separation with powerful learning machines such as deep neural networks (DNNs). For any supervised learning problem, generalization to unseen conditions is critical. This dissertation addresses generalization of supervised speech separation.;We first explore acoustic features for supervised speech separation in low SNR conditions. An extensive list of acoustic features is evaluated for IBM estimation. The list includes ASR features, speaker recognition features and speech separation features. In addition, we propose the Multi-Resolution Cochleagram (MRCG) feature to incorporate both local information and broader spectrotemporal contexts. We find that gammatone-domain features, especially the proposed MRCG features, perform well for supervised speech separation at low SNRs.;Noise segment generalization is desired for noise-dependent speech separation. When tested on the same noise type, a learning machine needs to generalize to unseen noise segments. For nonstationary noises, there exists a considerable mismatch between training and testing segments, which leads to poor performance during testing. We explore noise perturbation techniques to expand training noise for better generalization. Experiments show that frequency perturbation effectively reduces false-alarm errors in mask estimation and leads to improved objective metrics of speech intelligibility.;Speech separation in unseen environments requires generalization to unseen noise types, not just noise segments. By exploring large-scale training, we find that a DNN based IRM estimator trained on a large variety of noises generalizes well to unseen noises. Even for highly nonstationary noises, the noise-independent model achieves similar performance as noise-dependent models in terms of objective speech intelligibility measures. Further experiments with human subjects lead to the first demonstration that supervised speech separation improves speech intelligibility for hearing-impaired listeners in novel noises.;Besides noise generalization, speaker generalization is critical for many applications where target speech may be produced by an unseen speaker. We observe that training a DNN with many speakers leads to poor speaker generalization. The performance on seen speakers degrades as additional speakers are added for training. Such a DNN suffers from the confusion of target speech and interfering speech fragments embedded in noise. We propose a model based on recurrent neural network (RNN) with long short-term memory (LSTM) to incorporate the temporal dynamics of speech. We find that the trained LSTM keeps track of a target speaker and substantially improves speaker generalization over DNN. Experiments show that the proposed model generalizes to unseen noises, unseen SNRs and unseen speakers.
机译:语音对于人类交流至关重要,因为它不仅传递信息,而且表达情感。实际上,语音通常会被背景噪音和房间混响破坏。在低信噪比(SNR)条件下感知语音非常具有挑战性,特别是对于听力受损的听众。因此,我们有动力开发语音分离算法以提高嘈杂语音的清晰度。鉴于助听器和强大的自动语音识别(ASR)等多种应用,语音分离一直是语音处理中的重要问题。数十年来,语音分离可以通过估算理想二进制掩码(IBM)或理想比率掩码来实现(IRM)。在嘈杂语音的时频(T-F)表示中,IBM保留语音为主的T-F单元,并丢弃噪声为主的T-F单元。类似地,IRM调整每个T-F单元的增益以抑制噪声。这样,语音分离可以被视为一种有监督的学习问题,其中人们可以从嘈杂的语音中估计出理想的蒙版。监督语音分离的三个关键组成部分是学习机,声学特征和训练目标。这种受监督的框架使您能够使用功能强大的学习机(例如深度神经网络(DNN))来处理语音分离。对于任何有监督的学习问题,泛化到看不见的条件都是至关重要的。本文主要研究监督语音分离的一般性。我们首先探讨了低信噪比条件下监督语音分离的声学特征。评估了众多声学功能,以供IBM评估。该列表包括ASR功能,说话人识别功能和语音分离功能。此外,我们提出了多分辨耳蜗图(MRCG)功能,以结合本地信息和更广阔的光谱时脉环境。我们发现,在低信噪比的情况下,伽马通域特征(尤其是拟议的MRCG特征)在监督语音分离方面表现良好。在相同的噪声类型上进行测试时,学习机需要推广到看不见的噪声段。对于非平稳噪声,训练和测试段之间存在相当大的失配,这会导致测试期间的性能不佳。我们探索噪声扰动技术来扩展训练噪声,以实现更好的通用性。实验表明,频率扰动有效地减少了掩模估计中的误报错误,并提高了语音清晰度的客观指标。在看不见的环境中进行语音分离需要对看不见的噪声类型进行综合,而不仅仅是噪声段。通过探索大规模训练,我们发现基于DNN的,对多种噪声进行训练的IRM估计器可以很好地推广到看不见的噪声。即使对于高度不稳定的噪声,就客观语音可理解度而言,噪声独立模型也可以获得与噪声依赖模型类似的性能。对人类受试者的进一步实验导致了第一个演示,即有监督的语音分离可以改善新型噪声中听障听众的语音清晰度。除了噪声泛化之外,说话人泛化对于许多可能由看不见的讲话者产生目标语音的应用也至关重要。我们观察到,用许多说话人训练DNN会导致说话人泛化能力差。添加额外的扬声器进行培训后,所看到的扬声器的性能会下降。这样的DNN遭受目标语音的混乱和嵌入在噪声中的干扰语音片段的困扰。我们提出了一个基于递归神经网络(RNN)的模型,该模型具有较长的短期记忆(LSTM),以结合语音的时态动态。我们发现训练有素的LSTM可以跟踪目标说话人,并大大提高了DNN上说话人的概括性。实验表明,所提出的模型可以概括为看不见的噪声,看不见的SNR和看不见的说话人。

著录项

  • 作者

    Chen, Jitong.;

  • 作者单位

    The Ohio State University.;

  • 授予单位 The Ohio State University.;
  • 学科 Engineering.;Computer science.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 142 p.
  • 总页数 142
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号