首页> 外文学位 >Speech Synthesis for Text-Based Editing of Audio Narration
【24h】

Speech Synthesis for Text-Based Editing of Audio Narration

机译:基于文本的音频旁白编辑的语音合成

获取原文
获取原文并翻译 | 示例

摘要

Recorded audio narration plays a crucial role in many contexts including online lectures, documentaries, demo videos, podcasts, and radio. However, editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to perform select, cut, copy and paste operations in the text transcript of the narration and apply the changes to the waveform accordingly. However such interfaces do not support the ability to synthesize new words not appearing in the transcript. While it is possible to build a high fidelity speech synthesizer based on samples of a new voice, to operate well they typically require a large amount of voice data as input as well as substantial manual annotation.;This thesis presents a speech synthesizer tailored for text-based editing of narrations. The basic idea is to synthesize the input word in a different voice using a standard pre-built speech synthesizer and then transform the voice to the desired voice using voice conversion. Unfortunately, conventional voice conversion does not produce synthesis with sufficient quality for the stated application. Hence, this thesis introduces new voice conversion techniques that synthesize words with high individuality and clarity. Three methods are proposed: the first approach is called CUTE, a data-driven voice conversion method based on frame-level unit selection and exemplar features. The second method called VoCo is built on CUTE but with several improvements that help the synthesized word blend more seamlessly into the context where it is inserted Both CUTE and VoCo select sequences of audio frames from the voice samples and stitch them together to approximate the voice being converted. The third method improves over VoCo with deep neural networks. It involves two networks: FFTNet generates high quality waveforms based on acoustic features, and TimbreNet transforms the acoustic feature of the generic synthesizer voice to that of a human voice.
机译:录制的音频旁白在许多情况下都起着至关重要的作用,包括在线讲座,纪录片,演示视频,播客和广播。但是,使用常规软件编辑音频旁白通常涉及许多艰苦的低级操作。一些最新的系统允许编辑者在旁白的文本记录中执行选择,剪切,复制和粘贴操作,并将更改相应地应用于波形。但是,这样的界面不支持合成未在抄本中出现的新单词的功能。虽然可以基于新语音的样本构建高保真语音合成器,但要正常运行,它们通常需要大量语音数据作为输入以及大量的手动注释。基于叙述的编辑。基本思想是使用标准的预建语音合成器以不同的语音合成输入单词,然后使用语音转换将语音转换为所需的语音。不幸的是,常规的语音转换不能产生用于所述应用的足够质量的合成。因此,本文引入了新的语音转换技术,该技术可以高度个性化和清晰地合成单词。提出了三种方法:第一种方法称为CUTE,这是一种基于帧级单位选择和示例特征的数据驱动的语音转换方法。第二种称为VoCo的方法建立在CUTE的基础上,但进行了一些改进,可帮助将合成的单词更无缝地融合到插入的上下文中。CUTE和VoCo都从语音样本中选择音频帧序列并将它们拼接在一起以近似语音转换。第三种方法通过深度神经网络对VoCo进行了改进。它涉及两个网络:FFTNet根据声学特征生成高质量的波形,TimbreNet将通用合成器语音的声学特征转换为人类语音的声学特征。

著录项

  • 作者

    Jin, Zeyu.;

  • 作者单位

    Princeton University.;

  • 授予单位 Princeton University.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2018
  • 页码 124 p.
  • 总页数 124
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号