首页> 外文会议>IEEE International Conference on Soft Computing and Machine Intelligence >Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder
【24h】

Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder

机译:通过在Waveglow Vocoder上的变压器网络调节谱图预测的自然文本致辞综合

获取原文

摘要

Text to Speech (TTS) is a form of speech synthesis where the text is converted into a spoken human-like voice output. The state of the art methods for TTS employs a neural network based approach. This work aims to look at some of the issues and limitations present in the current works, specifically Tacotron-2, and attempts to further improve its performance by modifying its architecture. The modified model uses Transformer network as a Spectrogram Prediction Network (SPN) and WaveGlow as an Audio Generation Network (AGN). For the modified model, performance improvements are seen in terms of the speech output generated for corresponding texts, the inference time taken for audio generation, and a Mean Opinion Score (MOS) of 4.10 (out of 5) is obtained.
机译:语音(TTS)的文本是一种语音合成的形式,其中文本被转换为口语的人类语音输出。 TTS的技术方法采用基于神经网络的方法。这项工作旨在了解当前工作中存在的一些问题和限制,特别是Tacotron-2,并试图通过修改其架构来进一步提高其性能。修改的模型使用变压器网络作为频谱图预测网络(SPN)和波导作为音频生成网络(AGN)。对于修改的模型,就对应文本产生的语音输出而言,可以获得性能改进,获得用于音频生成的推理时间和4.10的平均意见分数(MOS)(5)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号