首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis
【24h】

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

机译:矢量量化变形AutoEncoder(VQ-VAE)自动增加神经$ F_0 $模型用于统计参数致辞

获取原文
获取原文并翻译 | 示例

摘要

Recurrent neural networks (RNNs) can predict fundamental frequency (F0) for statistical parametric speech synthesis systems, given linguistic features as input. However, these models assume conditional independence between consecutive $F_0$ values, given the RNN state. In a previous study, we proposed autoregressive (AR) neural $F_0$ models to capture the causal dependency of successive $F_0$ values. In subjective evaluations, a deep AR model (DAR) outperformed an RNN. Here, we propose a Vector Quantized Variational Autoencoder (VQ-VAE) neural $F_0$ model that is both more efficient and more interpretable than the DAR. This model has two stages: one uses the VQ-VAE framework to learn a latent code for the $F_0$ contour of each linguistic unit, and other learns to map from linguistic features to latent codes. In contrast to the DAR and RNN, which process the input linguistic features frame-by-frame, the new model converts one linguistic feature vector into one latent code for each linguistic unit. The new model achieves better objective scores than the DAR, has a smaller memory footprint and is computationally faster. Visualization of the latent codes for phones and moras reveals that each latent code represents an $F_0$ shape for a linguistic unit.
机译:经常性神经网络(RNN)可以预测统计参数语音合成系统的基本频率(F0),给定语言特征作为输入。然而,这些模型在连续的情况下假设条件独立性<内联公式XMLNS:MML =“http://www.w3.org/1998/math/mathml”xmlns:xlink =“http://www.w3.org/1999/xlink”> $ F_0 $ 鉴于RNN状态的值。在以前的一项研究中,我们提出了自回(AR)神经网络<内联公式XMLNS:MML =“http://www.w3.org/1998/math/mathml”xmlns:xlink =“http://www.w3.org/1999/xlink”> $ F_0 $ 模型以捕捉连续的因果依赖性<内联公式XMLNS:MML =“http://www.w3.org/1998/math/mathml”xmlns:xlink =“http://www.w3.org/1999/xlink”> $ F_0 $ 价值观。在主观评估中,深度AR模型(​​DAR)优于RNN。在这里,我们提出了一种矢量量化变分性AutoEncoder(VQ-VAE)神经网络<内联公式XMLNS:MML =“http://www.w3.org/1998/math/mathml”xmlns:xlink =“http://www.w3.org/1999/xlink”> $ F_0 $ 模型比达尔更高效,更可观。该模型有两个阶段:一个阶段使用VQ-VAE框架来学习潜在的代码<内联公式XMLNS:MML =“http://www.w3.org/1998/math/mathml”xmlns:xlink =“http://www.w3.org/1999/xlink”> $ F_0 $ 每个语言单位的轮廓,其他语言和其他学会从语言特征映射到潜在代码。与DAR和RNN相比,该进程输入语言特征帧框架帧,新模型将一个语言特征向量转换为每个语言单元的一个潜在代码。新模式实现了比DAR更好的客观分数,具有较小的内存占用空间,并计算得更快。用于手机和Moras的潜在代码的可视化揭示了每个潜在代码代表一个<内联公式XMLNS:MML =“http://www.w3.org/1998/math/mathml”xmlns:xlink =“http://www.w3.org/1999/xlink”> $ F_0 $ 语言单位的形状。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号