首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >Simpleflat: A Simple Whole-Network Pre-Training Approach for RNN Transducer-Based End-to-End Speech Recognition
【24h】

Simpleflat: A Simple Whole-Network Pre-Training Approach for RNN Transducer-Based End-to-End Speech Recognition

机译:SimpleFlat:基于RNN传感器的端到端语音识别的简单全网络预训练方法

获取原文

摘要

Recurrent neural network-transducer (RNN-T) is promising for building time-synchronous end-to-end automatic speech recognition (ASR) systems, in part because it does not need frame-wise alignment between input features and target labels in the training step. Although training without alignment is beneficial, it makes it difficult to discern the relation between input features and output token sequences. This, in effect, degrades RNN-T performance. Our solution is SimpleFlat (SF), a novel and simple whole-network pretraining approach for RNN-T. SF extracts frame-wise alignments on-the-fly from the training dataset, and does not require any external resources. We distribute equal numbers of target tokens to each frame following RNN-T encoder output lengths by repeating each token. The frame-wise tokens so created are shifted, and also used as the prediction network inputs. Therefore, SF can be implemented by cross entropy loss computation as in autoregressive model training. Experiments on Japanese and English ASR tasks demonstrate that SF can effectively improve various RNN-T architectures.
机译:经常性的神经网络传感器(RNN-T)是对构建时间同步端到端的自动语音识别(ASR)系统的承诺,部分原因是在训练中不需要输入特征和目标标签之间的框架方向对齐步。虽然没有对齐的训练是有益的,但它使得难以辨别输入特征和输出令牌序列之间的关系。实际上,这会降低RNN-T性能。我们的解决方案是SimpleFlat(SF),新颖且简单的全网络预防方法,用于RNN-T。 SF从训练数据集中从飞行中提取帧 - 方向对齐,不需要任何外部资源。通过重复每个令牌,我们在RNN-T编码器输出长度之后向每个帧分发相同数量的目标令牌。如此创建的帧展令牌被移位,并且也用作预测网络输入。因此,可以通过跨熵损耗计算来实现SF,如自回归模型训练中。日语和英语ASR任务的实验表明SF可以有效地改善各种RNN-T架构。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号