首页> 外文期刊>ACM transactions on Asian and low-resource language information processing >Learning and Modeling Unit Embeddings Using Deep Neural Networks for Unit-Selection-Based Mandarin Speech Synthesis
【24h】

Learning and Modeling Unit Embeddings Using Deep Neural Networks for Unit-Selection-Based Mandarin Speech Synthesis

机译:基于单位选择的普通话语音合成学习与建模单元嵌入

获取原文
获取原文并翻译 | 示例
       

摘要

A method of learning and modeling unit embeddings using deep neutral networks (DNNs) is presented in this article for unit-selection-based Mandarin speech synthesis. Here, a unit embedding is defined as a fixed-length embedding vector for a phone-sized unit candidate in a corpus. Modeling phone-sized embedding vectors instead of frame-sized acoustic features can better measure the long-term dependencies among consecutive units in an utterance. First, a DNN with an embedding layer is built to learn the embedding vectors of all unit candidates in the corpus from scratch. In order to enable the extracted embedding vectors to carry both acoustic and linguistic information of unit candidates, a multitarget learning strategy is designed for the DNN. Its optional prediction targets include frame-level acoustic features, unit durations, monophone and tone identifiers, and context classes. Then, another two DNNs are constructed to map linguistic features toward the extracted embedding vectors. One of them employs the unit vectors of preceding phones besides the linguistic features of current phone as its input. At synthesis time, the distances between the unit vectors predicted by these two DNNs and the ones derived from unit candidates are used as a part of the target cost and a part of the concatenation cost, respectively. Our experiments on a Mandarin speech synthesis corpus demonstrate that learning and modeling unit embeddings improve the naturalness of hidden Markov model (HMM)-based unit selection speech synthesis. Furthermore, integrating multiple targets for learning unit embeddings achieves better performance than using only acoustic targets according to our subjective evaluation results.
机译:本文提出了一种使用深空网(DNN)的学习和建模单元嵌入的方法,用于基于单位选择的普通话语音合成。这里,单位嵌入被定义为语料库中的电话大小的单元候选的固定长度嵌入矢量。建模电话大小的嵌入向量而不是框架大小的声学功能可以更好地测量话语中连续单位之间的长期依赖性。首先,建立具有嵌入层的DNN,以了解从划痕中的所有单位候选者的嵌入向量。为了使提取的嵌入向量能够承载单位候选者的声学和语言信息,为DNN设计了多次数学习策略。其可选的预测目标包括帧级声学特征,单位持续时间,单声音和音调标识符和上下文类。然后,构造另外两个DNN以向提取的嵌入向量映射语言特征。除了当前手机的语言特征作为其输入之外,其中一部采用了前面的手机的单位矢量。在综合时间时,由这两个DNN预测的单位向量与从单元候选的单位向量之间的距离分别用作目标成本的一部分和串联成本的一部分。我们对普通话语音合成语料库的实验表明,学习和建模单位嵌入,提高了隐马尔可夫模型(HMM)的单位选择语音合成的自然度。此外,对学习单位嵌入的多个目标集成了比仅根据我们的主观评估结果的声学目标的性能更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号