首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >Investigation of Fast and Efficient Methods for Multi-Speaker Modeling and Speaker Adaptation
【24h】

Investigation of Fast and Efficient Methods for Multi-Speaker Modeling and Speaker Adaptation

机译:多扬声器建模与扬声器适应快速高效的研究

获取原文

摘要

In this paper, we propose a novel method for fast and efficient few-shot TTS task, which is able to disentangle linguistic and speaker representations. Specifically, an adversarial training strategy is firstly employed to wipe out speaker information from the linguistic representations. Then the speaker representations are extracted from audio signals by a speaker encoder with a random sampling mechanism and a speaker classifier, aiming to extract speaker embedding features that are independent of content information (such as prosody and style etc). Meanwhile, for faster and efficient adaptation, we further introduce the prior alignment knowledge between the text and audio pairs and propose a multi-alignment guided attention to help the attention learning. The Experimental results show the proposed method not only could generate higher speech quality and speaker similarity with an average absolute improvement of 0.26 and 0.30 in MOS respectively, when adapting to new speakers with 20 utterances, but also converge much faster and efficient. More-over, we can achieve a MOS of 4.45 for a premium voice, which outperforms a single speaker model of 4.23. 1
机译:在本文中,我们提出了一种新颖的快速有效的少量TTS任务,能够解开语言和扬声器表示。具体而言,首先采用对抗语培训策略来消除语言表征的发言人信息。然后,扬声器表示由扬声器编码器用具有随机采样机制和扬声器分类器的音频信号提取,旨在提取独立于内容信息(例如韵律和风格等)的扬声器嵌入特征。同时,为了更快,更高效的适应,我们进一步介绍了文本和音频对之间的先前对准知识,并提出了一种多对准的引导,以帮助注意学习。实验结果表明,在适应具有20个话语的新扬声器的情况下,拟议的方法不仅可以产生更高的语音质量和扬声器相似性,而且分别在MOS中的平均绝对改善0.26和0.30。更多,我们可以实现4.45的MOS,以获得优质的声音,这优于一个4.23的单个扬声器型号。 1

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号