首页> 外文期刊>IEICE transactions on information and systems >Pre-Training of DNN-Based Speech Synthesis Based on Bidirectional Conversion between Text and Speech
【24h】

Pre-Training of DNN-Based Speech Synthesis Based on Bidirectional Conversion between Text and Speech

机译:基于文本和语音之间双向转换的基于DNN的语音合成的预训练

获取原文
           

摘要

Conventional approaches to statistical parametric speech synthesis use context-dependent hidden Markov models (HMMs) clustered using decision trees to generate speech parameters from linguistic features. However, decision trees are not always appropriate to model complex context dependencies of linguistic features efficiently. An alternative scheme that replaces decision trees with deep neural networks (DNNs) was presented as a possible way to overcome the difficulty. By training the network to represent high-dimensional feedforward dependencies from linguistic features to acoustic features, DNN-based speech synthesis systems convert a text into a speech. To improved the naturalness of the synthesized speech, this paper presents a novel pre-training method for DNN-based statistical parametric speech synthesis systems. In our method, a deep relational model (DRM), which represents a joint probability of two visible variables, is applied to describe the joint distribution of acoustic and linguistic features. As with DNNs, a DRM consists several hidden layers and two visible layers. Although DNNs represent feedforward dependencies from one visible variables (inputs) to other visible variables (outputs), a DRM has an ability to represent the bidirectional dependencies between two visible variables. During the maximum-likelihood (ML) -based training, the model optimizes its parameters (connection weights between two adjacent layers, and biases) of a deep architecture considering the bidirectional conversion between 1) acoustic features given linguistic features, and 2) linguistic features given acoustic features generated from itself. Owing to considering whether the generated acoustic features are recognizable, our method can obtain reasonable parameters for speech synthesis. Experimental results in a speech synthesis task show that pre-trained DNN-based systems using our proposed method outperformed randomly-initialized DNN-based systems, especially when the amount of training data is limited. Additionally, speaker-dependent speech recognition experimental results also show that our method outperformed DNN-based systems, by setting the initial parameters of our method are the same as that in the synthesis experiments.
机译:统计参数语音合成的常规方法使用上下文相关的隐马尔可夫模型(HMM),这些模型使用决策树进行聚类,以根据语言特征生成语音参数。但是,决策树并不总是适合于有效地建模语言特征的复杂上下文相关性。提出了一种用深度神经网络(DNN)代替决策树的替代方案,作为克服困难的一种可能方法。通过训练网络来表示从语言特征到声学特征的高维前馈依赖性,基于DNN的语音合成系统将文本转换为语音。为了提高合成语音的自然性,本文提出了一种新的基于DNN的统计参数语音合成系统的预训练方法。在我们的方法中,代表两个可见变量的联合概率的深层关系模型(DRM)用于描述声学和语言特征的联合分布。与DNN一样,DRM包含几个隐藏层和两个可见层。尽管DNN表示从一个可见变量(输入)到其他可见变量(输出)的前馈依赖关系,但DRM具有表示两个可见变量之间的双向依赖关系的能力。在基于最大似然(ML)的训练期间,该模型会考虑1)给定语言特征的声学特征和2)语言特征之间的双向转换,从而优化其深度结构的参数(两个相邻层之间的连接权重和偏差)。给定自身产生的声学特征。由于考虑到生成的声学特征是否可识别,我们的方法可以获得合理的语音合成参数。语音合成任务的实验结果表明,使用我们提出的方法对基于DNN进行预训练的系统优于随机初始化的基于DNN的系统,尤其是在训练数据量有限的情况下。此外,基于说话者的语音识别实验结果还表明,通过设置方法的初始参数,该方法的性能优于基于DNN的系统,与合成实验中的方法相同。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号