首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis
【24h】

GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis

机译:GlotNet-统计参数语音合成中声门激励的原始波形模型

获取原文
获取原文并翻译 | 示例

摘要

Recently, generative neural network models which operate directly on raw audio, such as WaveNet, have improved the state of the art in text-to-speech synthesis (TTS). Moreover, there is increasing interest in using these models as statistical vocoders for generating speech waveforms from various acoustic features. However, there is also a need to reduce the model complexity, without compromising the synthesis quality. Previously, glottal pulse-forms (i.e., time-domain waveforms corresponding to the source of human voice production mechanism) have been successfully synthesized in TTS by glottal vocoders using straightforward deep feedforward neural networks. Therefore, it is natural to extend the glottal waveform modeling domain to use the more powerful WaveNet-like architecture. Furthermore, due to their inherent simplicity, glottal excitation waveforms permit scaling down the waveform generator architecture. In this study, we present a raw waveform glottal excitation model, called GlotNet, and compare its performance with the corresponding direct speech waveform model, WaveNet, using equivalent architectures. The models are evaluated as part of a statistical parametric TTS system. Listening test results show that both approaches are rated highly in voice similarity to the target speaker, and obtain similar quality ratings with large models. Furthermore, when the model size is reduced, the quality degradation is less severe for GlotNet.
机译:最近,直接在原始音频(例如WaveNet)上运行的生成型神经网络模型改善了文本到语音合成(TTS)的技术水平。此外,将这些模型用作统计声码器以从各种声学特征生成语音波形的兴趣日益浓厚。然而,还需要在不损害合成质量的情况下降低模型复杂度。以前,声门声码器已经使用简单的深度前馈神经网络在声发射系统中成功地合成了声门脉冲形式(即,对应于人类声音产生机制来源的时域波形)。因此,自然会扩展声门波形建模域,以使用功能更强大的类似WaveNet的体系结构。此外,由于其固有的简单性,声门激励波形允许按比例缩小波形发生器的架构。在这项研究中,我们提出了一种原始波形声门激励模型,称为GlotNet,并使用等效架构将其性能与相应的直接语音波形模型WaveNet进行了比较。将模型作为统计参数TTS系统的一部分进行评估。听力测试结果表明,这两种方法在语音上与目标说话人的相似性都得到很高的评价,并且在大型模型中获得相似的质量评价。此外,当模型尺寸减小时,对于GlotNet而言,质量下降的程度较小。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号