首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with Background Music
【24h】

A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with Background Music

机译:利用与背景音乐混合的广播数据的神经文本到语音模型

获取原文

摘要

Recently, it has become easier to obtain speech data from various media such as the internet or YouTube, but directly utilizing them to train a neural text-to-speech (TTS) model is difficult. The proportion of clean speech is insufficient and the remainder includes background music. Even with the global style token (GST). Therefore, we propose the following method to successfully train an end-to-end TTS model with limited broadcast data. First, the background music is removed from the speech by introducing a music filter. Second, the GST-TTS model with an auxiliary quality classifier is trained with the filtered speech and a small amount of clean speech. In particular, the quality classifier makes the embedding vector of the GST layer focus on representing the speech quality (filtered or clean) of the input speech. The experimental results verified that the proposed method synthesized much more high-quality speech than conventional methods.
机译:最近,从诸如因特网或YouTube等各种媒体获取语音数据变得更容易,但直接利用它们训练神经文本到语音(TTS)模型很难。 清洁语音的比例不足,其余包括背景音乐。 即使是全球风格的令牌(GST)。 因此,我们提出了通过有限的广播数据成功培训结束TTS模型的以下方法。 首先,通过引入音乐滤波器,从语音中移除背景音乐。 其次,具有辅助质量分类器的GST-TTS模型具有滤波的语音和少量的清洁语音培训。 特别地,质量分类器使得GST层的嵌入向量集中于表示输入语音的语音质量(过滤或清洁)。 实验结果证实,所提出的方法合成比常规方法更高的高质量语音。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号