Pre-Training of DNN-Based Speech Synthesis Based on Bidirectional Conversion between Text and Speech

Kentaro SONE; Toru NAKASHIKA

首页> 外文期刊>IEICE transactions on information and systems >Pre-Training of DNN-Based Speech Synthesis Based on Bidirectional Conversion between Text and Speech

【24h】

Pre-Training of DNN-Based Speech Synthesis Based on Bidirectional Conversion between Text and Speech

机译：基于文本和语音之间双向转换的基于DNN的语音合成的预训练

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Conventional approaches to statistical parametric speech synthesis use context-dependent hidden Markov models (HMMs) clustered using decision trees to generate speech parameters from linguistic features. However, decision trees are not always appropriate to model complex context dependencies of linguistic features efficiently. An alternative scheme that replaces decision trees with deep neural networks (DNNs) was presented as a possible way to overcome the difficulty. By training the network to represent high-dimensional feedforward dependencies from linguistic features to acoustic features, DNN-based speech synthesis systems convert a text into a speech. To improved the naturalness of the synthesized speech, this paper presents a novel pre-training method for DNN-based statistical parametric speech synthesis systems. In our method, a deep relational model (DRM), which represents a joint probability of two visible variables, is applied to describe the joint distribution of acoustic and linguistic features. As with DNNs, a DRM consists several hidden layers and two visible layers. Although DNNs represent feedforward dependencies from one visible variables (inputs) to other visible variables (outputs), a DRM has an ability to represent the bidirectional dependencies between two visible variables. During the maximum-likelihood (ML) -based training, the model optimizes its parameters (connection weights between two adjacent layers, and biases) of a deep architecture considering the bidirectional conversion between 1) acoustic features given linguistic features, and 2) linguistic features given acoustic features generated from itself. Owing to considering whether the generated acoustic features are recognizable, our method can obtain reasonable parameters for speech synthesis. Experimental results in a speech synthesis task show that pre-trained DNN-based systems using our proposed method outperformed randomly-initialized DNN-based systems, especially when the amount of training data is limited. Additionally, speaker-dependent speech recognition experimental results also show that our method outperformed DNN-based systems, by setting the initial parameters of our method are the same as that in the synthesis experiments.

机译：统计参数语音合成的常规方法使用上下文相关的隐马尔可夫模型（HMM），这些模型使用决策树进行聚类，以根据语言特征生成语音参数。但是，决策树并不总是适合于有效地建模语言特征的复杂上下文相关性。提出了一种用深度神经网络（DNN）代替决策树的替代方案，作为克服困难的一种可能方法。通过训练网络来表示从语言特征到声学特征的高维前馈依赖性，基于DNN的语音合成系统将文本转换为语音。为了提高合成语音的自然性，本文提出了一种新的基于DNN的统计参数语音合成系统的预训练方法。在我们的方法中，代表两个可见变量的联合概率的深层关系模型（DRM）用于描述声学和语言特征的联合分布。与DNN一样，DRM包含几个隐藏层和两个可见层。尽管DNN表示从一个可见变量（输入）到其他可见变量（输出）的前馈依赖关系，但DRM具有表示两个可见变量之间的双向依赖关系的能力。在基于最大似然（ML）的训练期间，该模型会考虑1）给定语言特征的声学特征和2）语言特征之间的双向转换，从而优化其深度结构的参数（两个相邻层之间的连接权重和偏差）。给定自身产生的声学特征。由于考虑到生成的声学特征是否可识别，我们的方法可以获得合理的语音合成参数。语音合成任务的实验结果表明，使用我们提出的方法对基于DNN进行预训练的系统优于随机初始化的基于DNN的系统，尤其是在训练数据量有限的情况下。此外，基于说话者的语音识别实验结果还表明，通过设置方法的初始参数，该方法的性能优于基于DNN的系统，与合成实验中的方法相同。

著录项

来源
《IEICE transactions on information and systems》 |2019年第8期|共8页
作者
Kentaro SONE; Toru NAKASHIKA;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类无线电电子学、电信技术;
关键词
speech synthesisgenerative modelsBoltzmann distributionspre-training methodsdeep neural networks;

机译：语音合成生成模型玻尔兹曼分布预训练方法深层神经网络;

相似文献

外文文献
中文文献
专利

1. DNN-based grapheme-to-phoneme conversion for Arabic text-to-speech synthesis [J] . Ikbel Hadj Ali, Zied Mnasri, Zied Lachiri International journal of speech technology . 2020,第3期

机译：基于DNN的石墨对音素转换，用于阿拉伯语文本到语音合成
2. Text To Speech Conversion Using Different Speech Synthesis [J] . Hay Mar Htun, Theingi Zin, Hla Myo Tun International Journal of Scientific & Technology Research . 2015,第7期

机译：使用不同语音合成的文本到语音的转换
3. Text To Speech Conversion Using Different Speech Synthesis [J] . Hay Mar Htun, Theingi Zin, Hla Myo Tun International Journal of Scientific & Technology Research . 2015,第7期

机译：使用不同语音合成的文本到语音的转换
4. Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis [C] . Manuel Sam Ribeiro, Oliver Watts, Junichi Yamagishi Annual Conference of the International Speech Communication Association . 2016

机译：基于DNN的文本到语音合成的Suprace段特征的音节级表示
5. Speech Synthesis for Text-Based Editing of Audio Narration [D] . Jin, Zeyu. 2018

机译：基于文本的音频旁白编辑的语音合成
6. Temporal Integration of Text Transcripts and Acoustic Features for Alzheimers Diagnosis Based on Spontaneous Speech [O] . Matej Martinc, Fasih Haider, Senja Pollak, 2021

机译：基于自发演讲的阿尔茨海默诊断的文本成绩单和声学特征的时间集成
7. Pre-Training of DNN-Based Speech Synthesis Based on Bidirectional Conversion between Text and Speech [O] . Kentaro SONE, Toru NAKASHIKA 2019

机译：基于文本和语音之间的双向转换的基于DNN的语音合成的预训练

Pre-Training of DNN-Based Speech Synthesis Based on Bidirectional Conversion between Text and Speech

摘要

著录项

相似文献

相关主题

期刊订阅