Cracking the cocktail party problem by multi-beam deep attractor network

机译：通过多波束深吸引子网络解决鸡尾酒会问题

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

While recent progresses in neural network approaches to singlechannel speech separation, or more generally the cocktail party problem, achieved significant improvement, their performance for complex mixtures is still not satisfactory. In this work, we propose a novel multi-channel framework for multi-talker separation. In the proposed model, an input multi-channel mixture signal is firstly converted to a set of beamformed signals using fixed beam patterns. For this beamforming, we propose to use differential beamformers as they are more suitable for speech separation. Then each beamformed signal is fed into a single-channel anchored deep attractor network to generate separated signals. And the final separation is acquired by post selecting the separating output for each beams. To evaluate the proposed system, we create a challenging dataset comprising mixtures of 2, 3 or 4 speakers. Our results show that the proposed system largely improves the state of the art in speech separation, achieving 11.5 dB, 11.76 dB and 11.02 dB average signal-to-distortion ratio improvement for 4, 3 and 2 overlapped speaker mixtures, which is comparable to the performance of a minimum variance distortionless response beamformer that uses oracle location, source, and noise information. We also run speech recognition with a clean trained acoustic model on the separated speech, achieving relative word error rate (WER) reduction of 45.76%, 59.40% and 62.80% on fully overlapped speech of 4, 3 and 2 speakers, respectively. With a far talk acoustic model, the WER is further reduced.

机译：尽管神经网络方法在单通道语音分离（或更普遍地说，鸡尾酒会问题）方面的最新进展取得了显着改善，但它们在复杂混合物中的性能仍不令人满意。在这项工作中，我们提出了一种用于多通话者分离的新颖的多通道框架。在提出的模型中，首先使用固定的波束方向图将输入的多通道混合信号转换为一组波束形成的信号。对于这种波束成形，我们建议使用差分波束成形器，因为它们更适合语音分离。然后，每个波束形成的信号被馈送到单通道锚定深吸引网络中，以产生分离的信号。然后通过为每个光束选择分离输出来获得最终分离。为了评估建议的系统，我们创建了一个具有挑战性的数据集，其中包含2个，3个或4个发言人的混合。我们的结果表明，所提出的系统极大地改善了语音分离的技术水平，对于4种，3种和2种重叠扬声器混合，平均信噪比提高了11.5 dB，11.76 dB和11.02 dB，与使用Oracle位置，源和噪声信息的最小方差无失真响应波束形成器的性能。我们还对分离的语音使用了干净训练有素的声学模型来运行语音识别，在4、3和2完全重叠的语音上实现了45.76％，59.40％和62.80％的相对单词错误率（WER）降低扬声器。使用远距离声学模型，WER进一步降低。

著录项

来源
《2017 IEEE Automatic Speech Recognition and Understanding Workshop》|2017年|437-444|共8页
会议地点 Okinawa(JP)
作者
Zhuo Chen; Jinyu Li; Xiong Xiao; Takuya Yoshioka; Huaming Wang; Zhenghao Wang; Yifan Gong;
展开▼
作者单位

Microsoft AI and Research;

Microsoft AI and Research;

Microsoft AI and Research;

Microsoft AI and Research;

Microsoft AI and Research;

Microsoft AI and Research;

Microsoft AI and Research;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Speech; Microphones; Neural networks; Acoustic beams; Speech recognition; Time-frequency analysis; Machine learning;

机译：语音;麦克风;神经网络;声束;语音识别;时频分析;机器学习;;

相似文献

外文文献
中文文献
专利

1. Auditory attention tracking states in a cocktail party environment can be decoded by deep convolutional neural networks [J] . Yin Tian, Liang Ma Journal of neural engineering . 2020,第3期

机译：在鸡尾酒会环境中的听觉注意力跟踪状态可以由深卷积神经网络解码
2. Multichannel signal separation for cocktail party speech recognition: a dynamic recurrent network [J] . Seungjin Choi, Heonseok Hong, Herve Glotin, Neurocomputing . 2002,第Dec期

机译：用于鸡尾酒会语音识别的多通道信号分离：动态循环网络
3. Integrated deep visual and semantic attractor neural networks predict fMRI pattern-information along the ventral object processing pathway [J] . Barry J. Devereux, Alex Clarke, Lorraine K. Tyler Scientific reports. . 2018,第1期

机译：集成的深度视觉和语义吸引子神经网络可预测腹侧物体加工路径上的fMRI模式信息
4. Cracking the cocktail party problem by multi-beam deep attractor network [C] . Zhuo Chen, Jinyu Li, Xiong Xiao, IEEE Workshop on Automatic Speech Recognition and Understanding . 2017

机译：通过多光束深吸引子网络破解鸡尾酒会问题
5. Anchor Word based Deep Attractor Network for Multi-Speaker Separation [D] . Qian Jiayi 2019

机译：基于锚词的深层吸引网络用于多扬声器分离
6. Integrated deep visual and semantic attractor neural networks predict fMRI pattern-information along the ventral object processing pathway [O] . Barry J. Devereux, Alex Clarke, Lorraine K. Tyler -1

机译：集成的深层视觉和语义吸引子神经网络可预测腹侧物体处理路径上的fMRI模式信息
7. Speech signal enhancement in cocktail party scenarios by deep learning based virtual sensing of head-mounted microphones [O] . Tim Fischer, Marco Caversaccio, Wilhelm Wimmer 2021

机译：基于深度学习的头戴式麦克风虚拟感测的鸡尾酒会方案中的语音信号增强

Cracking the cocktail party problem by multi-beam deep attractor network

摘要

著录项

相似文献

相关主题

期刊订阅