A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

Xin Wang; Shinji Takaki; Junichi Yamagishi; Simon King; Keiichi Tokuda

首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

【24h】

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

机译：矢量量化变形AutoEncoder（VQ-VAE）自动增加神经$ F_0 $模型用于统计参数致辞

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Recurrent neural networks (RNNs) can predict fundamental frequency (F0) for statistical parametric speech synthesis systems, given linguistic features as input. However, these models assume conditional independence between consecutive

$F_0$

values, given the RNN state. In a previous study, we proposed autoregressive (AR) neural

$F_0$

models to capture the causal dependency of successive

$F_0$

values. In subjective evaluations, a deep AR model (DAR) outperformed an RNN. Here, we propose a Vector Quantized Variational Autoencoder (VQ-VAE) neural

$F_0$

model that is both more efficient and more interpretable than the DAR. This model has two stages: one uses the VQ-VAE framework to learn a latent code for the

$F_0$

contour of each linguistic unit, and other learns to map from linguistic features to latent codes. In contrast to the DAR and RNN, which process the input linguistic features frame-by-frame, the new model converts one linguistic feature vector into one latent code for each linguistic unit. The new model achieves better objective scores than the DAR, has a smaller memory footprint and is computationally faster. Visualization of the latent codes for phones and moras reveals that each latent code represents an

$F_0$

shape for a linguistic unit.

机译：经常性神经网络（RNN）可以预测统计参数语音合成系统的基本频率（F0），给定语言特征作为输入。然而，这些模型在连续的情况下假设条件独立性<内联公式XMLNS：MML =“http://www.w3.org/1998/math/mathml”xmlns：xlink =“http://www.w3.org/1999/xlink”> $ F_0 $ 鉴于RNN状态的值。在以前的一项研究中，我们提出了自回（AR）神经网络<内联公式XMLNS：MML =“http://www.w3.org/1998/math/mathml”xmlns：xlink =“http://www.w3.org/1999/xlink”> $ F_0 $ 模型以捕捉连续的因果依赖性<内联公式XMLNS：MML =“http://www.w3.org/1998/math/mathml”xmlns：xlink =“http://www.w3.org/1999/xlink”> $ F_0 $ 价值观。在主观评估中，深度AR模型（DAR）优于RNN。在这里，我们提出了一种矢量量化变分性AutoEncoder（VQ-VAE）神经网络<内联公式XMLNS：MML =“http://www.w3.org/1998/math/mathml”xmlns：xlink =“http://www.w3.org/1999/xlink”> $ F_0 $ 模型比达尔更高效，更可观。该模型有两个阶段：一个阶段使用VQ-VAE框架来学习潜在的代码<内联公式XMLNS：MML =“http://www.w3.org/1998/math/mathml”xmlns：xlink =“http://www.w3.org/1999/xlink”> $ F_0 $ 每个语言单位的轮廓，其他语言和其他学会从语言特征映射到潜在代码。与DAR和RNN相比，该进程输入语言特征帧框架帧，新模型将一个语言特征向量转换为每个语言单元的一个潜在代码。新模式实现了比DAR更好的客观分数，具有较小的内存占用空间，并计算得更快。用于手机和Moras的潜在代码的可视化揭示了每个潜在代码代表一个<内联公式XMLNS：MML =“http://www.w3.org/1998/math/mathml”xmlns：xlink =“http://www.w3.org/1999/xlink”> $ F_0 $ 语言单位的形状。

著录项

来源
《Audio, Speech, and Language Processing, IEEE/ACM Transactions on》 |2020年第2020期|157-170|共14页
作者
Xin Wang; Shinji Takaki; Junichi Yamagishi; Simon King; Keiichi Tokuda;
展开▼
作者单位

National Institute of Informatics Tokyo Japan;

Nagoya Institute of Technology Nagoya Japan;

National Institute of Informatics Tokyo Japan;

Centre for Speech Technology Research The University of Edinburgh Edinburgh U.K.;

Nagoya Institute of Technology Nagoya Japan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Hidden Markov models; Linguistics; Artificial neural networks; Computational modeling; Frequency synthesizers; Feature extraction; Speech processing;

机译：隐马尔可夫模型;语言学;人工神经网络;计算建模;频率合成器;特征提取;语音处理;

相似文献

外文文献
中文文献
专利

1. Autoregressive Models for Statistical Parametric Speech Synthesis [J] . Shannon M., Zen H., Byrne W. Audio, Speech, and Language Processing, IEEE Transactions on . 2013,第3期

机译：统计参数语音合成的自回归模型
2. Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis [J] . Xin Wang, Shinji Takaki, Junichi Yamagishi Audio, Speech, and Language Processing, IEEE/ACM Transactions on . 2020,第期

机译：神经源 - 滤波器波形模型用于统计参数语音合成
3. F_0 Modeling for Isarn Speech Synthesis using Deep Neural Networks and Syllable-level Feature Representation [J] . Janyoi Pongsathon, Seresangtakul Pusadee The international arab journal of information technology . 2020,第6期

机译：使用深神经网络和音节级特征表示，ISARN语音合成的F_0建模
4. Spectral modeling using neural autoregressive distribution estimators for statistical parametric speech synthesis [C] . Yin Xiang, Ling Zhen-Hua, Dai Li-Rong IEEE International Conference on Acoustics, Speech and Signal Processing . 2014

机译：使用神经自回归分布估计器进行频谱建模以进行统计参数语音合成
5. A neural-network-based online signature verification system using vector autoregressive modeling and a novel velocity segmentation scheme. [D] . Osman, Tarig Abd-Elgadir-Mohammed. 2009

机译：基于神经网络的在线签名验证系统，使用矢量自回归建模和新颖的速度分割方案。
6. Identifying regulational alterations in gene regulatory networks by state space representation of vector autoregressive models and variational annealing [O] . Kaname Kojima, Seiya Imoto, Rui Yamaguchi, 2012

机译：通过向量自回归模型的状态空间表示和变异退火来识别基因调控网络中的调控改变
7. Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis [O] . Xin Wang, Shinji Takaki, Junichi Yamagishi 2018

机译：统计参数致辞合成自动评级神经F0模型

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅