首页> 外文会议>Spoken Language Technology Workshop >Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech
【24h】

Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech

机译:多频段MELGAN:用于高质量文本的更快的波形发电

获取原文

摘要

In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively. With only 1.91M parameters, our model effectively reduces the total computational complexity of the original MelGAN from 5.85 to 0.95 GFLOPS. Our Pytorch implementation can achieve a real-time factor of 0.03 on CPU without hardware specific optimization.
机译:在本文中,我们提出了多频段MELGAN,一个更快的波形生成模型,其瞄准高质量的文本语音。具体而言,我们通过以下方面改善原始甜瓜。首先,我们增加了发电机的接受领域,这被证明是有利于语音生成的。其次,我们用多分离的STFT损失替代特征匹配损失,以更好地测量假和真实语音之间的差异。与预训练一起,这种改进导致更好的质量和更好的训练稳定性。更重要的是,我们使用多频带处理扩展MELGAN:发电机将MEL-谱图作为输入,并产生随后将副频带信号求回为满带信号作为鉴别器输入。所提出的多频带甜瓜分别在波形产生和TTS中实现了4.34和4.22的高MOS。只有1.91米的参数,我们的模型有效地降低了原始MELGAN的总计算复杂性从5.85到0.95 GFLOPS。我们的PyTorch实现可以在CPU上实现0.03的实时因子,而无需硬件特定优化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号