首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >A Neural Vocoder With Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis
【24h】

A Neural Vocoder With Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis

机译:具有分层生成幅度和相谱的神经探测器,用于统计参数致辞合成

获取原文
获取原文并翻译 | 示例

摘要

This article presents a neural vocoder named HiNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra hierarchically. Different from existing neural vocoders such as WaveNet, SampleRNN and WaveRNN which directly generate waveform samples using single neural networks, the HiNet vocoder is composed of an amplitude spectrum predictor (ASP) and a phase spectrum predictor (PSP). The ASP is a simple DNN model which predicts log amplitude spectra (LAS) from acoustic features. The predicted LAS are sent into the PSP for phase recovery. Considering the issue of phase warping and the difficulty of phase modeling, the PSP is constructed by concatenating a neural source-filter (NSF) waveform generator with a phase extractor. We also introduce generative adversarial networks (GANs) into both ASP and PSP. Finally, the outputs of ASP and PSP are combined to reconstruct speech waveforms by short-time Fourier synthesis. Since there are no autoregressive structures in both predictors, the HiNet vocoder can generate speech waveforms with high efficiency. Objective and subjective experimental results show that our proposed HiNet vocoder achieves better naturalness of reconstructed speech than the conventional STRAIGHT vocoder, a 16-bit WaveNet vocoder using open source implementation and an NSF vocoder with similar complexity to the PSP and obtains similar performance with a 16-bit WaveRNN vocoder. We also find that the performance of HiNet is insensitive to the complexity of the neural waveform generator in PSP to some extend. After simplifying its model structure, the time consumed for generating 1 s waveforms of 16 kHz speech using a GPU can be further reduced from 0.34 s to 0.19 s without significant quality degradation.
机译:本文介绍了一个名为HINET的神经声码器,通过层次地预测幅度和相位光谱来重建来自声学特征的语音波形。与现有的神经声码器(如Wavenet,Samplernn和Wavernn)不同,它使用单个神经网络直接产生波形样本,Hinet Vocoder由幅度频谱预测器(ASP)和相频谱预测器(PSP)组成。 ASP是一种简单的DNN模型,其从声学特征预测对数幅度谱(LAS)。预测的LAS被发送到PSP以进行相位恢复。考虑到阶段扭曲的问题和相位建模的难度,PSP是通过用相位提取器连接神经源 - 滤波器(NSF)波形发生器来构造的PSP。我们还将生成的对抗性网络(GAN)引入ASP和PSP。最后,将ASP和PSP的输出组合以通过短时傅立叶合成来重建语音波形。由于两个预测器中没有自动增加结构,因此HINET声码器可以以高效率产生语音波形。目的和主观实验结果表明,我们提出的HINET VOCODER实现了比传统的直接声码器的重建语音的自然度,使用开源实现和NSF VOCODER对PSP具有类似复杂性的NSF VOCODER,并使用16 -bit wavernn声码器。我们还发现,HINET的性能对PSP中神经波形发生器的复杂性不敏感到一些延伸。在简化其模型结构之后,使用GPU产生16kHz语音的1 S波形的时间可以从0.34秒进一步降低到0.19秒,而无明显的质量降级。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号