首页> 外文期刊>Selected Topics in Signal Processing, IEEE Journal of >SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures
【24h】

SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures

机译:SpeakerBeam:用于语音混合中目标说话人提取的说话人感知神经网络

获取原文
获取原文并翻译 | 示例

摘要

The processing of speech corrupted by interfering overlapping speakers is one of the challenging problems with regards to today's automatic speech recognition systems. Recently, approaches based on deep learning have made great progress toward solving this problem. Most of these approaches tackle the problem as speech separation, i.e., they blindly recover all the speakers from the mixture. In some scenarios, such as smart personal devices, we may however be interested in recovering one target speaker from a mixture. In this paper, we introduce Speaker-Beam, a method for extracting a target speaker from the mixture based on an adaptation utterance spoken by the target speaker. Formulating the problem as speaker extraction avoids certain issues such as label permutation and the need to determine the number of speakers in the mixture. With SpeakerBeam, we jointly learn to extract a representation from the adaptation utterance characterizing the target speaker and to use this representation to extract the speaker. We explore several ways to do this, mostly inspired by speaker adaptation in acoustic models for automatic speech recognition. We evaluate the performance on the widely used WSJ0-2mix and WSJ0-3mix datasets, and these datasets modified with more noise or more realistic overlapping patterns. We further analyze the learned behavior by exploring the speaker representations and assessing the effect of the length of the adaptation data. The results show the benefit of including speaker information in the processing and the effectiveness of the proposed method.
机译:对于当今的自动语音识别系统而言,由于重叠的扬声器受到干扰而导致的语音处理问题是一个具有挑战性的问题。最近,基于深度学习的方法在解决这个问题上取得了长足的进步。这些方法中的大多数解决了语音分离时的问题,即,它们盲目地从混合中恢复了所有说话者。但是,在某些情况下,例如智能个人设备,我们可能有兴趣从混音中恢复一位目标说话者。在本文中,我们介绍了Speaker-Beam,这是一种根据目标说话者说出的自适应话语从混合物中提取目标说话者的方法。将问题表示为说话人提取可以避免某些问题,例如标签排列以及确定混合物中说话人数量的需要。借助SpeakerBeam,我们共同学习从表征目标说话者的适应话语中提取一个表示,并使用该表示来提取说话者。我们探索了几种方法来实现此目的,这些方法主要是受声学模型中说话人自适应的启发而实现的,这些模型可以自动识别语音。我们在广泛使用的WSJ0-2mix和WSJ0-3mix数据集上评估了性能,这些数据集使用更多噪声或更实际的重叠模式进行了修改。我们通过探讨说话者的表现形式并评估适应数据长度的影响来进一步分析学习的行为。结果表明,在处理过程中包括说话人信息的好处以及所提方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号