Simpleflat: A Simple Whole-Network Pre-Training Approach for RNN Transducer-Based End-to-End Speech Recognition

机译：SimpleFlat：基于RNN传感器的端到端语音识别的简单全网络预训练方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Recurrent neural network-transducer (RNN-T) is promising for building time-synchronous end-to-end automatic speech recognition (ASR) systems, in part because it does not need frame-wise alignment between input features and target labels in the training step. Although training without alignment is beneficial, it makes it difficult to discern the relation between input features and output token sequences. This, in effect, degrades RNN-T performance. Our solution is SimpleFlat (SF), a novel and simple whole-network pretraining approach for RNN-T. SF extracts frame-wise alignments on-the-fly from the training dataset, and does not require any external resources. We distribute equal numbers of target tokens to each frame following RNN-T encoder output lengths by repeating each token. The frame-wise tokens so created are shifted, and also used as the prediction network inputs. Therefore, SF can be implemented by cross entropy loss computation as in autoregressive model training. Experiments on Japanese and English ASR tasks demonstrate that SF can effectively improve various RNN-T architectures.

机译：经常性的神经网络传感器（RNN-T）是对构建时间同步端到端的自动语音识别（ASR）系统的承诺，部分原因是在训练中不需要输入特征和目标标签之间的框架方向对齐步。虽然没有对齐的训练是有益的，但它使得难以辨别输入特征和输出令牌序列之间的关系。实际上，这会降低RNN-T性能。我们的解决方案是SimpleFlat（SF），新颖且简单的全网络预防方法，用于RNN-T。 SF从训练数据集中从飞行中提取帧 - 方向对齐，不需要任何外部资源。通过重复每个令牌，我们在RNN-T编码器输出长度之后向每个帧分发相同数量的目标令牌。如此创建的帧展令牌被移位，并且也用作预测网络输入。因此，可以通过跨熵损耗计算来实现SF，如自回归模型训练中。日语和英语ASR任务的实验表明SF可以有效地改善各种RNN-T架构。

著录项

来源
《IEEE International Conference on Acoustics, Speech and Signal Processing》|2021年|5664-5668|共5页
会议地点
作者
Takafumi Moriya; Takanori Ashihara; Tomohiro Tanaka; Tsubasa Ochiai; Hiroshi Sato; Atsushi Ando; Yusuke Ijima; Ryo Masumura; Yusuke Shinohara;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Training; Recurrent neural networks; Conferences; Signal processing algorithms; Signal processing; Feature extraction; Entropy;

机译：培训;经常性神经网络;会议;信号处理算法;信号处理;特征提取;熵;

相似文献

外文文献
中文文献
专利

1. An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition [J] . Bo Wu, Kehuang Li, Fengpei Ge, Selected Topics in Signal Processing, IEEE Journal of . 2017,第8期

机译：端到端深度学习方法可同时进行语音去混响和声学建模，以实现可靠的语音识别
2. Bridging automatic speech recognition and psycholinguistics: Extending Shortlist to an end-to-end model of human speech recognition (L) [J] . Odette Scharenborg, Louis ten Bosch, Lou Boves, The Journal of the Acoustical Society of America . 2003,第6期

机译：桥接自动语音识别和心理语言学：将候选清单扩展到人类语音识别的端到端模型（L）
3. Using Highway Connections to Enable Deep Small-footprint LSTM-RNNs for Speech Recognition [J] . Cheng Gaofeng, Li Xin, Yan Yonghong Chinese Journal of Electronics . 2019,第1期

机译：使用公路连接启用深度较小的LSTM-RNN进行语音识别
4. Exploring Pre-Training with Alignments for RNN Transducer Based End-to-End Speech Recognition [C] . Hu Hu, Rui Zhao, Jinyu Li, IEEE International Conference on Acoustics, Speech and Signal Processing . 2020

机译：基于RNN换能器的端到端语音识别探索对齐方式的预训练
5. End-to-End Speech Recognition on Conversations [D] . Kim, Suyoun . 2019

机译：对话的端到端语音识别
6. Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition [O] . Aleksandr Laptev, Andrei Andrusenko, Ivan Podluzhny, 2021

机译：用BPE-ropout进行动态声学单元增强用于低资源端到端语音识别
7. End-to-End Whispered Speech Recognition with Frequency-Weighted Approaches and Pseudo Whisper Pre-training [O] . Heng-Jui Chang, Alexander H. Liu, Hung-yi Lee, 2021

机译：以频率加权方法和伪耳语预培训的端到端低语的语音识别

Simpleflat: A Simple Whole-Network Pre-Training Approach for RNN Transducer-Based End-to-End Speech Recognition

摘要

著录项

相似文献

相关主题

期刊订阅