基于变分自编码的语气语音合成模型

摘要

语气作为一种重要情感表达信息，对说话人内容的表达起着重要作用。目前语音合成系统缺乏对语气的良好支持，合成语音也表现出乏味、单一的缺点。为了解决上述问题，提高合成语音的自然度，本文将统计参数语音合成(Statistical Parameter Speech Synthesis, SSPS)与具有强学习能力的变分自编码(Variational Autoencoder, VAE)模型相结合，以无监督的方式学习说话人潜在的语气信息，再通过加入分类器提高模型语气学习的准确率。我们提出了语气语音合成的系统框架，分为三部分：声学模型、语气模型以及合成模型。待合成的目标文本和语气分别利用声学模型与语气模型重构出的包括基频F0的声学特征。最后，将声学特征输入到WORLD声码器合成出带有目标语气的语音信号。本篇文章使用Blizzard Challenge 2018作为模型训练的语料库，最后通过实验结果表明，所提出的模型具有良好的语气生成性能。 Mood as the important emotional expression information plays an important role in the expression of the speaker’s content. The current speech synthesis system lacks good support for mood and synthetic speech also shows the shortcomings of monotonous and boring. In order to solve the above problems and improve the naturalness of the synthesized speech, we use Statistical Parameter Speech Synthesis (SSPS) and Variational Autoencoder (VAE) model with strong learning ability to learn the speaker’s potential mood information in an unsupervised manner, and then improve the accuracy of model mood learning by adding classifiers. We propose a systematic framework for speech synthesis with mood, which is divided into three parts: an acoustic model, a speech mood model, and a synthetic model. The target text and mood to be synthesized are reconstructed using the acoustic features including the fundamental frequency F0 using the acoustic model and the mood model, respectively. Finally, the acoustic features are input into the WORLD vocoder to synthesize speech signals with target mood. This article uses Blizzard Challenge 2018 as a corpus for model training, and finally, the experimental results show that the proposed model has a good performance for mood generation.

机译：语气作为一种重要情感表达信息，对说话人内容的表达起着重要作用。目前语音合成系统缺乏对语气的良好支持，合成语音也表现出乏味、单一的缺点。为了解决上述问题，提高合成语音的自然度，本文将统计参数语音合成(Statistical Parameter Speech Synthesis, SSPS)与具有强学习能力的变分自编码(Variational Autoencoder, VAE)模型相结合，以无监督的方式学习说话人潜在的语气信息，再通过加入分类器提高模型语气学习的准确率。我们提出了语气语音合成的系统框架，分为三部分：声学模型、语气模型以及合成模型。待合成的目标文本和语气分别利用声学模型与语气模型重构出的包括基频F0的声学特征。最后，将声学特征输入到WORLD声码器合成出带有目标语气的语音信号。本篇文章使用Blizzard Challenge 2018作为模型训练的语料库，最后通过实验结果表明，所提出的模型具有良好的语气生成性能。 Mood as the important emotional expression information plays an important role in the expression of the speaker’s content. The current speech synthesis system lacks good support for mood and synthetic speech also shows the shortcomings of monotonous and boring. In order to solve the above problems and improve the naturalness of the synthesized speech, we use Statistical Parameter Speech Synthesis (SSPS) and Variational Autoencoder (VAE) model with strong learning ability to learn the speaker’s potential mood information in an unsupervised manner, and then improve the accuracy of model mood learning by adding classifiers. We propose a systematic framework for speech synthesis with mood, which is divided into three parts: an acoustic model, a speech mood model, and a synthetic model. The target text and mood to be synthesized are reconstructed using the acoustic features including the fundamental frequency F0 using the acoustic model and the mood model, respectively. Finally, the acoustic features are input into the WORLD vocoder to synthesize speech signals with target mood. This article uses Blizzard Challenge 2018 as a corpus for model training, and finally, the experimental results show that the proposed model has a good performance for mood generation.

基于变分自编码的语气语音合成模型

摘要

著录项

相关主题

期刊订阅