首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >Improving Streaming Automatic Speech Recognition with Non-Streaming Model Distillation on Unsupervised Data
【24h】

Improving Streaming Automatic Speech Recognition with Non-Streaming Model Distillation on Unsupervised Data

机译:在无监督数据上利用非流式模型蒸馏改善流自动语音识别

获取原文

摘要

Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher to generate transcripts on an arbitrarily large data set, which is then used to distill knowledge into streaming ASR models. This way, we scale the training of streaming models to up to 3 million hours of YouTube audio. Experiments show that our approach can significantly reduce the word error rate (WER) of RNN-T models not only on LibriSpeech but also on YouTube data in four languages. For example, in French, we are able to reduce the WER by 16.4% relatively to a baseline streaming model by leveraging a non-streaming teacher model trained on the same amount of labeled data as the baseline.
机译:流端到端的自动语音识别(ASR)模型广泛用于智能扬声器和设备上的应用程序。由于这些模型预计通过最小延迟转换语音,因此与其非流媒体对应物相比,它们被限制为没有未来上下文的因果。因此,流模型通常比非流模型执行差。我们通过利用非流媒体ASR模型作为教师提出一种新颖且有效的学习方法,以在任意大的数据集上生成转录物,然后将其用于将知识蒸馏到流式的ASR模型。这样,我们将流式模型的培训扩展到高达300万小时的YouTube音频。实验表明,我们的方法可以显着降低RNN-T模型的单词错误率(WER),而不仅仅是在SiblisPeech上,还可以在四种语言中对YouTube数据进行了解。例如,在法语中,我们可以通过利用与基线相同的标记数据培训的非流式教师模型来将WER减少16.4%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号