首页> 外文会议>Spoken Language Technology Workshop >Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech

【24h】

Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech

机译：多频段MELGAN：用于高质量文本的更快的波形发电

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively. With only 1.91M parameters, our model effectively reduces the total computational complexity of the original MelGAN from 5.85 to 0.95 GFLOPS. Our Pytorch implementation can achieve a real-time factor of 0.03 on CPU without hardware specific optimization.

机译：在本文中，我们提出了多频段MELGAN，一个更快的波形生成模型，其瞄准高质量的文本语音。具体而言，我们通过以下方面改善原始甜瓜。首先，我们增加了发电机的接受领域，这被证明是有利于语音生成的。其次，我们用多分离的STFT损失替代特征匹配损失，以更好地测量假和真实语音之间的差异。与预训练一起，这种改进导致更好的质量和更好的训练稳定性。更重要的是，我们使用多频带处理扩展MELGAN：发电机将MEL-谱图作为输入，并产生随后将副频带信号求回为满带信号作为鉴别器输入。所提出的多频带甜瓜分别在波形产生和TTS中实现了4.34和4.22的高MOS。只有1.91米的参数，我们的模型有效地降低了原始MELGAN的总计算复杂性从5.85到0.95 GFLOPS。我们的PyTorch实现可以在CPU上实现0.03的实时因子，而无需硬件特定优化。

著录项

来源
《Spoken Language Technology Workshop》|2021年|492-498|共7页
会议地点
作者
Geng Yang; Shan Yang; Kai Liu; Peng Fang; Wei Chen; Lei Xie;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Training; Vocoders; Loss measurement; Generators; Stability analysis; Real-time systems; Speech processing;

机译：培训;声码器;损失测量;发电机;稳定性分析;实时系统;语音处理;
入库时间 2022-08-26 13:52:51

相似文献

外文文献
中文文献
专利

1. Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis [J] . Ankit Sharma, Puneet Kumar, Vikas Maddukuri, Multimedia Tools and Applications . 2020,第41a42期

机译：基于快速的基于Griffin Lim的文本与语音合成波形生成策略
2. Parameter Generation Methods With Rich Context Models for High-Quality and Flexible Text-To-Speech Synthesis [J] . Selected Topics in Signal Processing, IEEE Journal of . 2014,第2期

机译：具有丰富上下文模型的参数生成方法，用于高质量和灵活的文本到语音合成
3. High-Quality Prosody Generation in Mandarin Text-to-Speech System [J] . Qing Guo, Jie Zhang, Nobuyuki Katae, Fujitsu Scientific & Technical Journal . 2010,第1期

机译：普通话语音合成系统中的高质量韵律生成
4. Waveform Generation for Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks [C] . Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, IEEE International Conference on Acoustics, Speech and Signal Processing . 2019

机译：音高同步多尺度生成对抗性网络用于文本到语音合成的波形生成
5. Fast, efficient generation of high-quality atomic charges. [D] . Jakalian, Araz. 2000

机译：快速，高效地生成高质量原子电荷。
6. A Deep-Sequencing Workflow for the Fast and Efficient Generation of High-Quality African Swine Fever Virus Whole-Genome Sequences [O] . Jan H. Forth, Leonie F. Forth, Jacqueline King, 2019

机译：快速高效生成高质量非洲猪瘟病毒全基因组序列的深度测序工作流程
7. Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis [O] . Ankit Sharma, Puneet Kumar, Vikas Maddukuri, 2020

机译：基于快速的基于Griffin Lim的文本与语音合成波形生成策略

Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech

摘要

著录项

相似文献

相关主题

期刊订阅