首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks
【24h】

Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks

机译:深度递归神经网络的话语水平置换不变训练的多说话人语音分离

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we propose the utterance-level permutation invariant training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep-learning-based solution for speaker independent multitalker speech separation. Specifically, uPIT extends the recently proposed permutation invariant training (PIT) technique with an utterance-level cost function, hence eliminating the need for solving an additional permutation problem during inference, which is otherwise required by frame-level PIT. We achieve this using recurrent neural networks (RNNs) that, during training, minimize the utterance-level separation error, hence forcing separated frames belonging to the same speaker to be aligned to the same output stream. In practice, this allows RNNs, trained with uPIT, to separate multitalker mixed speech without any prior knowledge of signal duration, number of speakers, speaker identity, or gender. We evaluated uPIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks and found that uPIT outperforms techniques based on nonnegative matrix factorization and computational auditory scene analysis, and compares favorably with deep clustering, and the deep attractor network. Furthermore, we found that models trained with uPIT generalize well to unseen speakers and languages. Finally, we found that a single model, trained with uPIT, can handle both two-speaker, and three-speaker speech mixtures.
机译:在本文中,我们提出了发声级置换不变训练(uPIT)技术。 uPIT是一种实用的,基于端到端,基于深度学习的解决方案,用于独立于扬声器的多方通话者语音分离。具体而言,uPIT用话语级成本函数扩展了最近提出的置换不变训练(PIT)技术,因此消除了在推理过程中解决其他置换问题的需求,而帧级PIT则需要解决这一问题。我们使用递归神经网络(RNN)在训练过程中将发声级分离误差降至最低,从而迫使属于同一说话者的分离帧与同一输出流对齐,从而实现了这一目标。实际上,这使经过uPIT培训的RNN可以分离多方通话者混合语音,而无需事先了解信号持续时间,说话者人数,说话者身份或性别。我们在WSJ0和丹麦两人和三人混合语音分离任务上对uPIT进行了评估,发现uPIT优于基于非负矩阵分解和计算听觉场景分析的技术,并且与深度聚类和深度吸引子网络相比具有优势。此外,我们发现使用uPIT训练的模型可以很好地推广到看不见的说话者和语言。最后,我们发现使用uPIT训练的单个模型可以同时处理两个讲话者和三个讲话者的混合语音。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号