首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS
【24h】

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

机译:在基于多说话者的汉语普通话神经TTS中使用语言和Bert衍生功能改善韵律

获取原文
获取外文期刊封面目录资料

摘要

Recent advances of neural TTS have made "human parity" synthesized speech possible when a large amount of studio-quality training data from a voice talent is available. However, with only limited, casual recordings from an ordinary speaker, human-like TTS is still a big challenge, in addition to other artifacts like incomplete sentences, repetition of words, etc. Chinese, a language, of which the text is different from that of other roman-letter based languages like English, has no blank space between adjacent words, hence word segmentation errors can cause serious semantic confusions and unnatural prosody. In this study, with a multi-speaker TTS to accommodate the insufficient training data of a target speaker, we investigate linguistic features and Bert-derived information to improve the prosody of our Mandarin Chinese TTS. Three factors are studied: phone-related and prosody-related linguistic features; better predicted breaks with a refined Bert-CRF model; augmented phoneme sequence with character embedding derived from a Bert model. Subjective tests on in- and out-domain tasks of News, Chat and Audiobook, have shown that all factors are effective for improving prosody of our Mandarin TTS. The model with additional character embeddings from Bert is the best one, which outperforms the baseline by 0.17 MOS gain.
机译:当来自语音人才的大量录音室质量的训练数据可用时,神经TTS的最新进展使“人均平价”合成语音成为可能。然而,除了普通讲话者的有限的随意录音外,像人类的TTS仍然是一个巨大的挑战,除了其他人工制品,例如不完整的句子,重复的单词等。中文,一种语言,其文字与与其他基于罗马字母的语言(例如英语)相比,相邻单词之间没有空格,因此,分词错误可能会导致严重的语义混乱和不自然的韵律。在这项研究中,我们使用多说话者的TTS来容纳目标说话者的不足训练数据,我们研究了语言特征和Bert派生的信息,以提高汉语普通话TTS的韵律。研究了三个因素:与电话有关的和与韵律有关的语言特征;与电话有关的语言特征。完善的Bert-CRF模型可以更好地预测休息时间;具有从Bert模型派生的字符嵌入的增强音素序列。对新闻,聊天和有声读物的域内和域外任务的主观测试表明,所有因素都可以有效改善我们普通话TTS的韵律。最好的模型是带有Bert附加字符嵌入的模型,该模型比基线高出0.17 MOS增益。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号