Multi-speaker Sequence-to-sequence Speech Synthesis for Data Augmentation in Acoustic-to-word Speech Recognition

机译：多说话人序列语音合成技术在语音到语音识别中的数据增强

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The acoustic-to-word (A2W) automatic speech recognition (ASR) realizes very fast decoding with a simple architecture and achieves state-of-the-art performance. However, the A2W model suffers from the out-of-vocabulary (OOV) word problem and cannot use text-only data to improve the language modeling capability. Meanwhile, sequence-to-sequence neural speech synthesis has also been developed and achieved naturalness comparable to human speech. We investigate leveraging sequence-to-sequence neural speech synthesis to augment training data for the ASR system in a target domain. While speech synthesis model is usually trained with single speaker data, ASR needs to cover a variety of speakers. In this work, we extend the speech synthesizer so that it can output speech of many speakers. The multi-speaker speech synthesizer is trained with a large corpus in the source domain, then used to generate acoustic features from texts of the target domain. These synthesized speech features are combined with real speech features of the source domain to train an attention-based A2W model. Experimental results show that the A2W model trained with the multi-speaker model achieved a significant improvement over the baseline and the single speaker model.

机译：声音到单词（A2W）自动语音识别（ASR）以简单的体系结构实现非常快速的解码，并实现了最新的性能。但是，A2W模型存在语音不足（OOV）词的问题，并且不能使用纯文本数据来提高语言建模能力。同时，还开发了序列到序列的神经语音合成，并实现了与人类语音相当的自然性。我们调查利用序列到序列的神经语音合成来增加目标域中ASR系统的训练数据。虽然语音合成模型通常使用单个说话者数据进行训练，但ASR需要涵盖各种说话者。在这项工作中，我们扩展了语音合成器，使其可以输出许多扬声器的语音。在源域中使用大型语料库训练多扬声器语音合成器，然后将其用于从目标域的文本生成声学特征。这些合成的语音特征与源域的真实语音特征相结合，以训练基于注意力的A2W模型。实验结果表明，使用多扬声器模型训练的A2W模型相对于基准线和单扬声器模型而言取得了显着改善。

著录项

来源
《IEEE International Conference on Acoustics, Speech and Signal Processing》|2019年|6161-6165|共5页
会议地点
作者
Sei Ueno; Masato Mimura; Shinsuke Sakai; Tatsuya Kawahara;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Training; Speech synthesis; Data models; Hidden Markov models; Decoding; Synthesizers;

机译：训练;语音合成;数据模型;隐马尔可夫模型;解码;合成器;

相似文献

外文文献
中文文献
专利

1. Data Augmentation Using Virtual Microphone Array Synthesis and Multi-Resolution Feature Extraction for Isolated Word Dysarthric Speech Recognition [J] . Celin T. A. Mariya, Nagarajan T., Vijayalakshmi P. Selected Topics in Signal Processing, IEEE Journal of . 2020,第2期

机译：使用虚拟麦克风阵列综合和多分辨率特征提取的数据增强用于隔离字发育arthric语音识别
2. Advances in Multi-speaker Conversational Speech Recognition and Understanding [J] . Takaaki Hori, Shoko Araki, Tomohiro Nakatani, NTT Technical Review . 2013,第12期

机译：多说话人会话语音识别与理解的新进展
3. Acoustic data augmentation for Mandarin-English code-switching speech recognition [J] . Applied Acoustics . 2020,第Apra期

机译：声学数据增强，用于普通话-英语代码转换语音识别
4. Multi-speaker Sequence-to-sequence Speech Synthesis for Data Augmentation in Acoustic-to-word Speech Recognition [C] . Sei Ueno, Masato Mimura, Shinsuke Sakai, IEEE International Conference on Acoustics, Speech and Signal Processing . 2019

机译：用于声学与单词语音识别中的数据增强的多扬声器序列到序列语音合成
5. Adaptation and Augmentation: Towards Better Rescoring Strategies for Automatic Speech Recognition and Spoken Term Detection [D] . Ma, Min. 2018

机译：适应和增强：寻求更好的自动语音识别和语音术语检测的评分策略
6. Incorporating Noise Robustness in Speech Command Recognition by Noise Augmentation of Training Data [O] . Ayesha Pervaiz, Fawad Hussain, Huma Israr, 2020

机译：通过训练数据的噪声增强将噪声鲁棒性纳入语音命令识别中
7. Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition [O] . Soltau, Hagen, Liao, Hank, Sak, Hasim 2016

机译：神经语音识别器：用于大型的声学到单词LsTm模型词汇语音识别

Multi-speaker Sequence-to-sequence Speech Synthesis for Data Augmentation in Acoustic-to-word Speech Recognition

摘要

著录项

相似文献

相关主题

期刊订阅