首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >Multi-speaker Emotional Acoustic Modeling for CNN-based Speech Synthesis
【24h】

Multi-speaker Emotional Acoustic Modeling for CNN-based Speech Synthesis

机译:基于CNN的语音合成的多说话人情感声学建模

获取原文

摘要

In this paper, we investigate multi-speaker emotional acoustic modeling methods for convolutional neural network (CNN) based speech synthesis system. For emotion modeling, we extend to the speech synthesis system that learns a latent embedding space of emotion, derived from a desired emotional identity, and we use emotion code and mel-frequency spectrogram as an emotion identity. In order to model speaker variation in a text-to-speech (TTS) system, we use speaker representations such as trainable speaker embedding and speaker code. We have implemented speech synthesis systems combining speaker representation and emotion representation and compared them by experiments. Experimental results have demonstrated that the multi-speaker emotional speech synthesis approach using trainable speaker embedding and emotion representation from mel spectrogram achieves higher performance when compared with other approaches in terms of naturalness, speaker similarity, and emotion similarity.
机译:在本文中,我们研究了基于卷积神经网络(CNN)的语音合成系统的多说话人情感声学建模方法。对于情感建模,我们扩展到语音合成系统,该系统学习从期望的情感标识派生的潜在的情感嵌入空间,并使用情感代码和梅尔频率频谱图作为情感标识。为了对文本到语音(TTS)系统中的说话人变化进行建模,我们使用了说话人表示形式,例如可训练的说话人嵌入和说话人代码。我们已经实现了语音合成系统,该系统将说话人表征和情感表征相结合,并通过实验进行了比较。实验结果表明,在自然性,说话者相似性和情感相似性方面,与其他方法相比,使用可训练的说话者嵌入和mel频谱图的情感表示的多说话者情感语音合成方法具有更高的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号