MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition

机译：MIMO-Speech：端到端多通道多扬声器语音识别

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Recently, the end-to-end approach has proven its efficacy in monaural multi-speaker speech recognition. However, high word error rates (WERs) still prevent these systems from being used in practical applications. On the other hand, the spatial information in multi-channel signals has proven helpful in far-field speech recognition tasks. In this work, we propose a novel neural sequence-to-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition. MIMO-Speech is a fully neural end-to-end framework, which is optimized only via an ASR criterion. It is comprised of: 1) a monaural masking network, 2) a multi-source neural beamformer, and 3) a multi-output speech recognition model. With this processing, the input overlapped speech is directly mapped to text sequences. We further adopted a curriculum learning strategy, making the best use of the training set to improve the performance. The experiments on the spatialized wsj1-2mix corpus show that our model can achieve more than 60% WER reduction compared to the single-channel system with high quality enhanced signals (SI-SDR = 23.1 dB) obtained by the above separation function.

机译：最近，端到端方法已证明其在单声道多说话者语音识别中的功效。但是，高字错误率（WER）仍然阻止这些系统在实际应用中使用。另一方面，事实证明，多通道信号中的空间信息对于远场语音识别任务很有帮助。在这项工作中，我们提出了一种新颖的神经序列到序列（seq2seq）架构MIMO-Speech，该体系结构扩展了原始seq2seq以处理多通道输入和多通道输出，从而可以完全建模多通道多-说话者语音分离和识别。 MIMO-Speech是一个完全神经的端到端框架，仅通过ASR标准对其进行了优化。它包括：1）单声道掩蔽网络，2）多源神经束形成器，以及3）多输出语音识别模型。通过该处理，将输入的重叠语音直接映射到文本序列。我们进一步采用了课程学习策略，充分利用了培训内容来提高绩效。在空间化wsj1-2mix语料库上进行的实验表明，与通过上述分离功能获得的具有高质量增强信号（SI-SDR = 23.1 dB）的单通道系统相比，我们的模型可以实现60％以上的WER降低。

著录项

来源
《IEEE Automatic Speech Recognition and Understanding Workshop》|2019年|237-244|共8页
会议地点
作者
Xuankai Chang; Wangyou Zhang; Yanmin Qian; Jonathan Le Roux; Shinji Watanabe;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Speech recognition; Hidden Markov models; Training; Speech processing; Neural networks; Array signal processing; Task analysis;

机译：语音识别;隐马尔可夫模型;训练;语音处理;神经网络;阵列信号处理;任务分析;

相似文献

外文文献
中文文献
专利

1. Bridging automatic speech recognition and psycholinguistics: Extending Shortlist to an end-to-end model of human speech recognition (L) [J] . Odette Scharenborg, Louis ten Bosch, Lou Boves, The Journal of the Acoustical Society of America . 2003,第6期

机译：桥接自动语音识别和心理语言学：将候选清单扩展到人类语音识别的端到端模型（L）
2. Advances in Multi-speaker Conversational Speech Recognition and Understanding [J] . Takaaki Hori, Shoko Araki, Tomohiro Nakatani, NTT Technical Review . 2013,第12期

机译：多说话人会话语音识别与理解的新进展
3. Representation transfer learning from deep end-to-end speech recognition networks for the classification of health states from speech [J] . Benjamin Sertolli, Zhao Ren, Bjoern W. Schuller, Computer speech and language . 2021,第Jula期

机译：从言语中，从深端到端语音识别网络中的代表转移学习
4. MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition [C] . Xuankai Chang, Wangyou Zhang, Yanmin Qian, IEEE Automatic Speech Recognition and Understanding Workshop . 2019

机译：MIMO-SPeew：端到端多通道多扬声器语音识别
5. End-to-End Speech Recognition on Conversations [D] . Kim, Suyoun . 2019

机译：对话的端到端语音识别
6. Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition [O] . Aleksandr Laptev, Andrei Andrusenko, Ivan Podluzhny, 2021

机译：用BPE-ropout进行动态声学单元增强用于低资源端到端语音识别
7. A Multi-Channel/Multi-Speaker Articulatory Database for Continuous Speech Recognition Research. [O] . Wrench Alan A 2000

机译：用于连续语音识别研究的多通道/多说话者发音数据库。

MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition

摘要

著录项

相似文献

相关主题

期刊订阅