首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis
【24h】

GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis

机译:GLOTNET - 一种原始波形模型,用于统计参数致辞综合作用

获取原文
获取原文并翻译 | 示例

摘要

Recently, generative neural network models which operate directly on raw audio, such as WaveNet, have improved the state of the art in text-to-speech synthesis (TTS). Moreover, there is increasing interest in using these models as statistical vocoders for generating speech waveforms from various acoustic features. However, there is also a need to reduce the model complexity, without compromising the synthesis quality. Previously, glottal pulse-forms (i.e., time-domain waveforms corresponding to the source of human voice production mechanism) have been successfully synthesized in TTS by glottal vocoders using straightforward deep feedforward neural networks. Therefore, it is natural to extend the glottal waveform modeling domain to use the more powerful WaveNet-like architecture. Furthermore, due to their inherent simplicity, glottal excitation waveforms permit scaling down the waveform generator architecture. In this study, we present a raw waveform glottal excitation model, called GlotNet, and compare its performance with the corresponding direct speech waveform model, WaveNet, using equivalent architectures. The models are evaluated as part of a statistical parametric TTS system. Listening test results show that both approaches are rated highly in voice similarity to the target speaker, and obtain similar quality ratings with large models. Furthermore, when the model size is reduced, the quality degradation is less severe for GlotNet.
机译:最近,直接在原始音频(如Wavenet)上运行的生成神经网络模型在文本到语音合成(TTS)中具有改进了最新技术。此外,使用这些模型作为统计声码器,越来越兴趣地利用来自各种声学特征的语音波形。然而,还需要降低模型复杂性,而不会影响合成质量。以前,光门脉冲形式(即,对应于人类语音生产机制的来源的时域波形)已经通过光泽的声码器在TTS中成功地合成了使用直接的深向前馈神经网络。因此,延长光泽波形建模域以使用更强大的Wavenet架构是自然的。此外,由于其固有的简单性,所引人注目的激励波形允许缩小波形发生器架构。在这项研究中,我们介绍了一个原始波形声门激励模型,称为GlotNet,并将其性能与使用等效架构进行相应的直接语音波形模型,Wavenet进行比较。该模型被评估为统计参数TTS系统的一部分。听力测试结果表明,两种方法都以音箱的语音相似高,并获得了具有大型型号的类似质量评级。此外,当模型尺寸减小时,GLOTNET的质量劣化不太严重。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号