首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations
【24h】

Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations

机译:具有解开语言和扬声器表示的非平行序列到序列语音转换

获取原文
获取原文并翻译 | 示例

摘要

This article presents a method of sequence-to-sequence (seq2seq) voice conversion using non-parallel training data. In this method, disentangled linguistic and speaker representations are extracted from acoustic features, and voice conversion is achieved by preserving the linguistic representations of source utterances while replacing the speaker representations with the target ones. Our model is built under the framework of encoder-decoder neural networks. A recognition encoder is designed to learn the disentangled linguistic representations with two strategies. First, phoneme transcriptions of training data are introduced to provide the references for leaning linguistic representations of audio signals. Second, an adversarial training strategy is employed to further wipe out speaker information from the linguistic representations. Meanwhile, speaker representations are extracted from audio signals by a speaker encoder. The model parameters are estimated by two-stage training, including a pre-training stage using a multi-speaker dataset and a fine-tuning stage using the dataset of a specific conversion pair. Since both the recognition encoder and the decoder for recovering acoustic features are seq2seq neural networks, there are no constrains of frame alignment and frame-by-frame conversion in our proposed method. Experimental results showed that our method obtained higher similarity and naturalness than the best non-parallel voice conversion method in Voice Conversion Challenge 2018. Besides, the performance of our proposed method was closed to the state-of-the-art parallel seq2seq voice conversion method.
机译:本文介绍了一种使用非并行训练数据的序列到序列(SEQ2Seq)语音转换的方法。在该方法中,从声学特征中提取解缠绕的语言和扬声器表示,并且通过在用目标方面替换扬声器表示的同时通过保护源话语的语言表示来实现语音转换。我们的模型是在编码器解码器神经网络的框架下构建的。识别编码器旨在学习具有两种策略的解除印奇语言表征。首先,引入了训练数据的音素转录,以提供对倾斜音频信号的语言表示的引用。其次,采用对抗培训策略来进一步消除语言表征的发言人信息。同时,扬声器编码器从音频信号中提取扬声器表示。模型参数由两阶段训练估计,包括使用多扬声器数据集和使用特定转换对的数据集的微调数据集的预训练阶段。由于识别编码器和用于恢复声学特征的解码器都是SEQ2Seq神经网络,因此在我们所提出的方法中没有帧对准和逐帧转换的约束。实验结果表明,我们的方法比2018年语音转换挑战的最佳非平行语音转换方法更高的相似性和自然。此外,我们提出的方法的性能被关闭到最先进的并联SEQ2Seq语音转换方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号