首页> 外文期刊>Computer speech and language >Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra
【24h】

Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra

机译:使用低/多频率STFT振幅谱的无声码合成语音合成网络

获取原文
获取原文并翻译 | 示例

摘要

This paper proposes novel training algorithms for vocoder-free text-to-speech (TTS) synthesis based on generative adversarial networks (GANs) that compensate for short-term Fourier transform (STFT) amplitude spectra in low/multi frequency resolution. Vocoder-free TTS using STFT amplitude spectra can avoid degradation of synthetic speech quality caused by the vocoder-based parameterization used in conventional TTS. Our previous work for the vocoder-based TTS proposed a method for incorporating the GAN-based distribution compensation into acoustic model training to improve synthetic speech quality. This paper extends the algorithm to the vocoder-free TTS and propose a GAN-based training algorithm using low-frequency-resolution amplitude spectra to overcome the difficulty in modeling complicated distribution of the high-dimensional spectra. In the proposed algorithm, amplitude spectra are transformed into low-frequency-resolution amplitude spectra by applying an average pooling function along with a frequency axis; then the GAN-based distribution compensation is performed in the low-frequency-resolution domain. Because the low-frequency-resolution amplitude spectra approximately emulate filter banks, the proposed algorithm is expected to improve synthetic speech quality by reducing differences in spectral envelopes of natural and synthetic speech. Furthermore, various frequency scales that are related to human speech perception (e.g., mel and inverse mel frequency scales) can be introduced to the proposed training algorithm by applying an frequency warping function to amplitude spectra. This paper also proposes a GAN-based training algorithm using multi-frequency-resolution amplitude spectra that uses both low- and original-frequency-resolution amplitude spectra to reduce the differences in not only spectral envelopes but also fine structures. Experimental results demonstrate that (1) GANs using low-frequency-resolution amplitude spectra improve speech quality and work robustly against the settings of the frequency resolution and hyperparameters, (2) in comparison among low-, original-, and multi-frequency-resolution amplitude spectra, the use of low-frequency-resolution ones work best improve the synthetic speech quality, and (3) the use of the inverse mel frequency scale for obtaining low-frequency-resolution amplitude spectra further improves synthetic speech quality. (C) 2019 The Authors. Published by Elsevier Ltd.
机译:本文提出了一种新的训练算法,该算法基于生成对抗网络(GAN)来补偿无语音编码器的语音合成(TTS),该算法可补偿低/多频率分辨率中的短期傅立叶变换(STFT)幅度谱。使用STFT振幅频谱的无声码器TTS可以避免常规TTS中基于声码器的参数化导致合成语音质量下降。我们先前针对基于声码器的TTS的工作提出了一种将基于GAN的分布补偿结合到声学模型训练中以提高合成语音质量的方法。本文将算法扩展到无声码器TTS,并提出了一种使用低频分辨率振幅谱的基于GAN的训练算法,以克服建模高维谱复杂分布的困难。在所提出的算法中,通过将平均池函数与频率轴一起应用,将振幅谱转换为低频分辨率的振幅谱。然后在低频分辨率域中执行基于GAN的分布补偿。由于低频分辨率振幅谱近似模拟滤波器组,因此该算法有望通过减少自然语音和合成语音的频谱包络差异来提高合成语音质量。此外,与人类语音感知有关的各种频率标度(例如,mel和mel反比频率标度)可以通过将频率扭曲函数应用于振幅谱而引入到所提出的训练算法中。本文还提出了一种基于GAN的使用多频率分辨率振幅谱的训练算法,该算法同时使用低和原始频率分辨率振幅谱来减少频谱包络和精细结构的差异。实验结果表明,(1)使用低频分辨率幅度谱的GAN可以提高语音质量,并且在频率分辨率和超参数设置方面表现出色,(2)在低,原始和多频率分辨率之间进行比较振幅频谱,使用低频分辨率振幅频谱最有效地改善了合成语音质量,(3)使用反向梅尔频率标度获得低频分辨率振幅频谱,进一步提高了合成语音质量。 (C)2019作者。由Elsevier Ltd.发布

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号