Improving Streaming Automatic Speech Recognition with Non-Streaming Model Distillation on Unsupervised Data

机译：在无监督数据上利用非流式模型蒸馏改善流自动语音识别

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher to generate transcripts on an arbitrarily large data set, which is then used to distill knowledge into streaming ASR models. This way, we scale the training of streaming models to up to 3 million hours of YouTube audio. Experiments show that our approach can significantly reduce the word error rate (WER) of RNN-T models not only on LibriSpeech but also on YouTube data in four languages. For example, in French, we are able to reduce the WER by 16.4% relatively to a baseline streaming model by leveraging a non-streaming teacher model trained on the same amount of labeled data as the baseline.

机译：流端到端的自动语音识别（ASR）模型广泛用于智能扬声器和设备上的应用程序。由于这些模型预计通过最小延迟转换语音，因此与其非流媒体对应物相比，它们被限制为没有未来上下文的因果。因此，流模型通常比非流模型执行差。我们通过利用非流媒体ASR模型作为教师提出一种新颖且有效的学习方法，以在任意大的数据集上生成转录物，然后将其用于将知识蒸馏到流式的ASR模型。这样，我们将流式模型的培训扩展到高达300万小时的YouTube音频。实验表明，我们的方法可以显着降低RNN-T模型的单词错误率（WER），而不仅仅是在SiblisPeech上，还可以在四种语言中对YouTube数据进行了解。例如，在法语中，我们可以通过利用与基线相同的标记数据培训的非流式教师模型来将WER减少16.4％。

著录项

来源
《IEEE International Conference on Acoustics, Speech and Signal Processing》|2021年|6558-6562|共5页
会议地点
作者
Thibault Doutre; Wei Han; Min Ma; Zhiyun Lu; Chung-Cheng Chiu; Ruoming Pang; Arun Narayanan; Ananya Misra; Yu Zhang; Liangliang Cao;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Training; Learning systems; Error analysis; Conferences; Signal processing; Data models; Acoustics;

机译：培训;学习系统;错误分析;会议;信号处理;数据模型;声学;

相似文献

外文文献
中文文献
专利

1. Bayesian Unsupervised Batch and Online Speaker Adaptation of Activation Function Parameters in Deep Models for Automatic Speech Recognition [J] . Zhen Huang, Sabato Marco Siniscalchi, Chin-Hui Lee Audio, Speech, and Language Processing, IEEE/ACM Transactions on . 2017,第1期

机译：用于语音识别的深度模型中激活函数参数的贝叶斯无监督批处理和在线说话者自适应
2. Automatic Query Generation and Query Relevance Measurement for Unsupervised Language Model Adaptation of Speech Recognition [J] . Akinori Ito, Yasutomo Kajiura, Motoyuki Suzuki, EURASIP journal on audio, speech, and music processing . 2009,第009期

机译：语音识别的无监督语言模型自适应自动查询生成和查询相关性度量
3. Automatic Query Generation and Query Relevance Measurement for Unsupervised Language Model Adaptation of Speech Recognition [J] . Akinori Ito, Yasutomo Kajiura, Motoyuki Suzuki, EURASIP journal on audio, speech, and music processing . 2009,第1期

机译：语音识别的无监督语言模型自适应自动查询生成和查询相关性度量
4. Automatic selection of speakers for improved acoustic modelling: recognition of disordered speech with sparse data [C] . Christensen H., Casanueva I., Cunningham S., IEEE Workshop on Spoken Language Technology . 2014

机译：自动选择扬声器以改善声学模型：使用稀疏数据识别混乱的语音
5. Compressive nonlinearity for representing speech spectral magnitude to improve noise robustness of automatic speech recognition . [D] . Wong, Brian. 2011

机译：压缩非线性表示语音频谱幅度提高语音自动识别的鲁棒性。
6. Unsupervised Adaptation of Categorical Prosody Models for Prosody Labeling and Speech Recognition [O] . Sankaranarayanan Ananthakrishnan, Shrikanth Narayanan -1

机译：类别韵律模型的无监督适应用于韵律标记和语音识别
7. Unsupervised clustering of audio data for acoustic modelling in automatic speech recognition systems [O] . Goussard George Willem 2011

机译：用于自动语音识别系统中的声学建模的无监督的音频数据聚类

Improving Streaming Automatic Speech Recognition with Non-Streaming Model Distillation on Unsupervised Data

摘要

著录项

相似文献

相关主题

期刊订阅