首页> 外文会议>European Conference on Speech Communication and Technology - EUROSPEECH 2003(INTERSPEECH 2003) vol.1; 20030901-04; Geneva(CH) >My Voice, Your Prosody: Sharing a speaker specific prosody model across speakers in unit selection TTS
【24h】

My Voice, Your Prosody: Sharing a speaker specific prosody model across speakers in unit selection TTS

机译:我的声音,您的韵律:在单元选择TTS中的各个扬声器之间共享特定于扬声器的韵律模型

获取原文
获取原文并翻译 | 示例

摘要

Data sparsity is a major problem for data driven prosodic models. Being able to share prosodic data across speakers is a potential solution to this problem. This paper explores this potential solution by addressing two questions: 1) Does a larger less sparse model from a different speaker produce more natural speech than a small sparse model built from the original speaker? 2)Does a different speaker's larger model generate more unit selection errors than a small sparse model built from the original speaker? A unit selection approach is used to produce a lazy learning model of three English RP speaker's f0 and durational parameters. Speaker 1 (the target speaker) had a much smaller database (approximately one quarter to one fifth the size) of the other two. Speaker 2 was a female speaker with frequent mid phrase rises. Speaker 3 was a male speaker with a similar f0 range to speaker 1 and with a measured prosodic style suitable for news and financial text. We apply the models created for speaker 2 (an inappropriate model) and speaker 3 (an appropriate model) to speaker 1 and compare the results. Three passages (of three to four sentences in length) from challenging prosodic genres (news report, poetry and personal email) were synthesised using the target speaker and each of the three models. The synthesised utterances were played to 15 native english subjects and rated using a 5 point MOS scale. In addition, 7 experienced speech engineers rated each word for errors on a three point scale: 1. Acceptable, 2. Poor, 3. Unacceptable. The results suggest that a large model from an appropriate speaker does not sound more natural or produce fewer errors than a smaller model generated from the individual speaker's own data. In addition it shows that an inappropriate model does produce both less natural and more errors in the speech. High variance in both subject and materials analysis suggest both tests are far from ideal and that evaluation techniques for both error rate and naturalness need to improve.
机译:数据稀疏性是数据驱动韵律模型的主要问题。能够在说话者之间共享韵律数据是解决此问题的潜在方法。本文将通过解决两个问题来探索这种潜在的解决方案:1)与使用原始说话者构建的小型稀疏模型相比,来自其他说话者的较大的稀疏模型是否会产生更自然的语音? 2)与从原始扬声器构建的小型稀疏模型相比,其他扬声器较大的模型是否会产生更多的单元选择错误?单位选择方法用于产生一个懒惰的学习模型,其中包括三个英语RP说话者的f0和持续时间参数。说话者1(目标说话者)的数据库要小得多(大约是其他两个数据库的四分之一到五分之一)。说话者2是一位女性说话者,其中间短语频繁出现。说话者3是男性说话者,其f0范围与说话者1相似,并且具有适用于新闻和财经文本的经测量韵律风格。我们将为说话者2(不合适的模型)和说话者3(适当的模型)创建的模型应用于说话者1并比较结果。使用目标说话者和这三个模型中的每一个,合成了具有挑战性的韵律类型(新闻报道,诗歌和个人电子邮件)的三段(长度为三至四个句子)。合成的话语针对15位英语母语者进行演奏,并使用5分MOS等级进行评分。此外,7位经验丰富的语音工程师以3分制对每个单词的错误评分:1.可以接受,2.较差,3.不可接受。结果表明,与根据单个说话者自己的数据生成的较小模型相比,来自合适说话者的较大模型听起来更自然或产生的错误更少。另外,它表明不合适的模型的确会在语音中产生更少自然的错误和更多错误。主题分析和材料分析的高差异性表明,这两种测试都远非理想,而且错误率和自然性的评估技术都需要改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号