首页> 外文会议>Spoken Language Technology Workshop >Cascade RNN-Transducer: Syllable Based Streaming On-Device Mandarin Speech Recognition with a Syllable-To-Character Converter
【24h】

Cascade RNN-Transducer: Syllable Based Streaming On-Device Mandarin Speech Recognition with a Syllable-To-Character Converter

机译:Cascade RNN-Cransducer:基于音节的流式媒体,具有音节到字符转换器的语音识别

获取原文

摘要

End-to-end models are favored in automatic speech recognition (ASR) because of its simplified system structure and superior performance. Among these models, recurrent neural network transducer (RNN-T) has achieved significant progress in streaming on-device speech recognition because of its high-accuracy and low-latency. RNN-T adopts a prediction network to enhance language information, but its language modeling ability is limited because it still needs paired speech-text data to train. Further strengthening the language modeling ability through extra text data, such as shallow fusion with an external language model, only brings a small performance gain. In view of the fact that Mandarin Chinese is a character-based language and each character is pronounced as a tonal syllable, this paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T. Our approach firstly uses an RNN-T to transform acoustic feature into syllable sequence, and then converts the syllable sequence into character sequence through an RNN-T-based syllable-to-character converter. Thus a rich text repository can be easily used to strengthen the language model ability. By introducing several important tricks, the cascade RNN-T approach surpasses the character-based RNN-T by a large margin on several Mandarin test sets, with much higher recognition quality and similar latency.
机译:由于其简化的系统结构和卓越的性能,在自动语音识别(ASR)中有利于端到端模型。在这些模型中,经常性的神经网络传感器(RNN-T)由于其高精度和低延迟而在媒体上进行了媒体媒体识别。 RNN-T采用预测网络来增强语言信息,但其语言建模能力是有限的,因为它仍然需要将配对的语音文本数据进行训练。通过额外的文本数据进一步加强语言建模能力,例如具有外部语言模型的浅融合,只带来小的性能增益。鉴于普通话是一种基于角色的语言,每个角色都被发音为色调音节,本文提出了一种新的级联RNN-T方法来提高RNN-T的语言建模能力。我们的方法首先使用RNN-T将声学特征转换为音节序列,然后通过基于RNN-T的音节到字符转换器将音节序列转换为字符序列。因此,可以轻松地使用丰富的文本存储库来加强语言模型能力。通过引入几个重要的技巧,级联RNN-T方法在若干普通话测试集上超过了基于角色的RNN-T,具有更高的识别质量和类似的延迟。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号