首页> 外文会议>IEEE Workshop on Automatic Speech Recognition and Understanding >Cracking the cocktail party problem by multi-beam deep attractor network
【24h】

Cracking the cocktail party problem by multi-beam deep attractor network

机译:通过多光束深吸引子网络破解鸡尾酒会问题

获取原文

摘要

While recent progresses in neural network approaches to singlechannel speech separation, or more generally the cocktail party problem, achieved significant improvement, their performance for complex mixtures is still not satisfactory. In this work, we propose a novel multi-channel framework for multi-talker separation. In the proposed model, an input multi-channel mixture signal is firstly converted to a set of beamformed signals using fixed beam patterns. For this beamforming, we propose to use differential beamformers as they are more suitable for speech separation. Then each beamformed signal is fed into a single-channel anchored deep attractor network to generate separated signals. And the final separation is acquired by post selecting the separating output for each beams. To evaluate the proposed system, we create a challenging dataset comprising mixtures of 2, 3 or 4 speakers. Our results show that the proposed system largely improves the state of the art in speech separation, achieving 11.5 dB, 11.76 dB and 11.02 dB average signal-to-distortion ratio improvement for 4, 3 and 2 overlapped speaker mixtures, which is comparable to the performance of a minimum variance distortionless response beamformer that uses oracle location, source, and noise information. We also run speech recognition with a clean trained acoustic model on the separated speech, achieving relative word error rate (WER) reduction of 45.76%, 59.40% and 62.80% on fully overlapped speech of 4, 3 and 2 speakers, respectively. With a far talk acoustic model, the WER is further reduced.
机译:虽然最近的神经网络的进展,用于Singlechannel演讲分离的方法,或者更常见的鸡尾酒会问题,但实现了显着的改进,它们对复杂混合物的性能仍然不令人满意。在这项工作中,我们提出了一种用于多讲车分离的新型多通道框架。在所提出的模型中,首先使用固定光束图案将输入多通道混合信号转换为一组波束形成信号。对于这种波束形成,我们建议使用差动波束形成器,因为它们更适合语音分离。然后将每个波束成形信号馈入单通道锚定深吸引器网络以产生分离信号。通过在选择每个光束的分离输出来获取最终分离。为了评估所提出的系统,我们创建了一个具有挑战性的数据集,包括2,3或4个扬声器的混合物。我们的研究结果表明,该系统在言语分离中大部分改善了最先进的状态,实现了11.5dB,11.76dB和11.02dB的平均信号与失真比为4,3和2的重叠扬声器混合物,其与使用Oracle位置,源和噪声信息的最小方差无失真响应波束的性能。我们还在分离的语音上使用清洁的训练有素的声学模型进行语音识别,分别实现了45.76 %,59.40 %和62.80 %的相对字错误率(WER),分别为4,3和2个扬声器的完全重叠的语音。 。通过遥远的声学模型,WER进一步减少。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号