My Voice, Your Prosody: Sharing a speaker specific prosody model across speakers in unit selection TTS

机译：我的声音，您的韵律：在单元选择TTS中的各个扬声器之间共享特定于扬声器的韵律模型

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data sparsity is a major problem for data driven prosodic models. Being able to share prosodic data across speakers is a potential solution to this problem. This paper explores this potential solution by addressing two questions: 1) Does a larger less sparse model from a different speaker produce more natural speech than a small sparse model built from the original speaker? 2)Does a different speaker's larger model generate more unit selection errors than a small sparse model built from the original speaker? A unit selection approach is used to produce a lazy learning model of three English RP speaker's f0 and durational parameters. Speaker 1 (the target speaker) had a much smaller database (approximately one quarter to one fifth the size) of the other two. Speaker 2 was a female speaker with frequent mid phrase rises. Speaker 3 was a male speaker with a similar f0 range to speaker 1 and with a measured prosodic style suitable for news and financial text. We apply the models created for speaker 2 (an inappropriate model) and speaker 3 (an appropriate model) to speaker 1 and compare the results. Three passages (of three to four sentences in length) from challenging prosodic genres (news report, poetry and personal email) were synthesised using the target speaker and each of the three models. The synthesised utterances were played to 15 native english subjects and rated using a 5 point MOS scale. In addition, 7 experienced speech engineers rated each word for errors on a three point scale: 1. Acceptable, 2. Poor, 3. Unacceptable. The results suggest that a large model from an appropriate speaker does not sound more natural or produce fewer errors than a smaller model generated from the individual speaker's own data. In addition it shows that an inappropriate model does produce both less natural and more errors in the speech. High variance in both subject and materials analysis suggest both tests are far from ideal and that evaluation techniques for both error rate and naturalness need to improve.

机译：数据稀疏性是数据驱动韵律模型的主要问题。能够在说话者之间共享韵律数据是解决此问题的潜在方法。本文将通过解决两个问题来探索这种潜在的解决方案：1）与使用原始说话者构建的小型稀疏模型相比，来自其他说话者的较大的稀疏模型是否会产生更自然的语音？ 2）与从原始扬声器构建的小型稀疏模型相比，其他扬声器较大的模型是否会产生更多的单元选择错误？单位选择方法用于产生一个懒惰的学习模型，其中包括三个英语RP说话者的f0和持续时间参数。说话者1（目标说话者）的数据库要小得多（大约是其他两个数据库的四分之一到五分之一）。说话者2是一位女性说话者，其中间短语频繁出现。说话者3是男性说话者，其f0范围与说话者1相似，并且具有适用于新闻和财经文本的经测量韵律风格。我们将为说话者2（不合适的模型）和说话者3（适当的模型）创建的模型应用于说话者1并比较结果。使用目标说话者和这三个模型中的每一个，合成了具有挑战性的韵律类型（新闻报道，诗歌和个人电子邮件）的三段（长度为三至四个句子）。合成的话语针对15位英语母语者进行演奏，并使用5分MOS等级进行评分。此外，7位经验丰富的语音工程师以3分制对每个单词的错误评分：1.可以接受，2.较差，3.不可接受。结果表明，与根据单个说话者自己的数据生成的较小模型相比，来自合适说话者的较大模型听起来更自然或产生的错误更少。另外，它表明不合适的模型的确会在语音中产生更少自然的错误和更多错误。主题分析和材料分析的高差异性表明，这两种测试都远非理想，而且错误率和自然性的评估技术都需要改进。

著录项

来源
《European Conference on Speech Communication and Technology - EUROSPEECH 2003(INTERSPEECH 2003) vol.1; 20030901-04; Geneva(CH)》|2003年|P.321-324|共4页
会议地点 Geneva(CH)
作者
Matthew Aylett; Justin Fackrell; Peter Rutten;
展开▼
作者单位

Rhetorical Systems Ltd.;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类自动信息理论;
关键词
入库时间 2022-08-26 13:48:54

相似文献

外文文献
中文文献
专利

1. Prosodic features-based speaker verification using speaker-specific-text for short utterances [J] . Jianwu Zhang, Jianchao He, Zhendong Wu, International Journal of Embedded Systems . 2017,第3期

机译：基于韵律的扬声器验证，使用扬声器特定文本进行短语
2. Early emotional prosody perception based on different speaker voices. [J] . Paulmann S, Kotz SA Neuroreport . 2008,第2期

机译：基于不同说话者声音的早期情绪韵律感知。
3. The impact of shared knowledge on speakers’ prosody [J] . Amandine Michelas, Cécile Cau, Maud Champagne-Lavau PLoS One . 2019,第10期

机译：共同知识对发言者韵律的影响
4. My Voice, Your Prosody: Sharing a speaker specific prosody model across speakers in unit selection TTS [C] . Matthew Aylett, Justin Fackrell, Peter Rutten, European Conference on Speech Communication and Technology . 2003

机译：我的声音，你的韵律：在单位选择TTS中分享讲话者特定韵律模型
5. Modeling prosodic differences for speaker and language recognition. [D] . Adami, Andre Gustavo. 2004

机译：为说话者和语言识别建模韵律差异。
6. The impact of shared knowledge on speakers’ prosody [O] . Amandine Michelas, Cécile Cau, Maud Champagne-Lavau 2012

机译：共享知识对演讲者韵律的影响
7. PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION [O] . Zi-he Chen, Yuan-fu Liao, Yau-tarng Juang 2009

机译：健壮说话人识别的模型建模与特征本构分析
8. Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings [R] . Kolar, J. , Shriberg, E. , Liu, Y. 2006

机译：用于多方会议自动对话行为分割的特定于说话者的韵律模型

My Voice, Your Prosody: Sharing a speaker specific prosody model across speakers in unit selection TTS

摘要

著录项

相似文献

相关主题

期刊订阅