首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >TTS-by-TTS: TTS-Driven Data Augmentation for Fast and High-Quality Speech Synthesis
【24h】

TTS-by-TTS: TTS-Driven Data Augmentation for Fast and High-Quality Speech Synthesis

机译:TTS-BY-TTS:TTS驱动的数据增强用于快速和高质量的语音合成

获取原文

摘要

In this paper, we propose a text-to-speech (TTS)-driven data augmentation method for improving the quality of a non-autoregressive (AR) TTS system. Recently proposed non-AR models, such as FastSpeech 2, have successfully achieved fast speech synthesis system. However, their quality is not satisfactory, especially when the amount of training data is insufficient. To address this problem, we propose an effective data augmentation method using a well-designed AR TTS system. In this method, large-scale synthetic corpora including text-waveform pairs with phoneme duration are generated by the AR TTS system, and then used to train the target non-AR model. Perceptual listening test results showed that the proposed method significantly improved the quality of the non-AR TTS system. In particular, we augmented five hours of a training database to 179 hours of a synthetic one. Using these databases, our TTS system consisting of a FastSpeech 2 acoustic model with a Parallel WaveGAN vocoder achieved a mean opinion score of 3.74, which is 40% higher than that achieved by the conventional method.
机译:在本文中,我们提出了一种文本 - 语音(TTS)驱动的数据增强方法,用于提高非自动增加(AR)TTS系统的质量。最近提出的非AR模型,如FastSeech 2,已成功实现了快速语音合成系统。然而,他们的质量并不令人满意,特别是当训练数据的数量不足时。为了解决这个问题,我们提出了一种使用精心设计的AR TTS系统的有效数据增强方法。在这种方法中,包括AR TTS系统生成具有音素持续时间的文本波形对的大规模合成语料库,然后用于训练目标非AR模型。感知听力测试结果表明,该方法显着提高了非AR TTS系统的质量。特别是,我们将五个小时的培训数据库增强到合成的179小时。使用这些数据库,我们的TTS系统由具有并行Wravan Vocoder的FastSeech 2声学模型组成,达到3.74的平均意见分数,比通过传统方法实现的40%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号