Speech Synthesis for Text-Based Editing of Audio Narration

机译：基于文本的音频旁白编辑的语音合成

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Recorded audio narration plays a crucial role in many contexts including online lectures, documentaries, demo videos, podcasts, and radio. However, editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to perform select, cut, copy and paste operations in the text transcript of the narration and apply the changes to the waveform accordingly. However such interfaces do not support the ability to synthesize new words not appearing in the transcript. While it is possible to build a high fidelity speech synthesizer based on samples of a new voice, to operate well they typically require a large amount of voice data as input as well as substantial manual annotation.;This thesis presents a speech synthesizer tailored for text-based editing of narrations. The basic idea is to synthesize the input word in a different voice using a standard pre-built speech synthesizer and then transform the voice to the desired voice using voice conversion. Unfortunately, conventional voice conversion does not produce synthesis with sufficient quality for the stated application. Hence, this thesis introduces new voice conversion techniques that synthesize words with high individuality and clarity. Three methods are proposed: the first approach is called CUTE, a data-driven voice conversion method based on frame-level unit selection and exemplar features. The second method called VoCo is built on CUTE but with several improvements that help the synthesized word blend more seamlessly into the context where it is inserted Both CUTE and VoCo select sequences of audio frames from the voice samples and stitch them together to approximate the voice being converted. The third method improves over VoCo with deep neural networks. It involves two networks: FFTNet generates high quality waveforms based on acoustic features, and TimbreNet transforms the acoustic feature of the generic synthesizer voice to that of a human voice.

机译：录制的音频旁白在许多情况下都起着至关重要的作用，包括在线讲座，纪录片，演示视频，播客和广播。但是，使用常规软件编辑音频旁白通常涉及许多艰苦的低级操作。一些最新的系统允许编辑者在旁白的文本记录中执行选择，剪切，复制和粘贴操作，并将更改相应地应用于波形。但是，这样的界面不支持合成未在抄本中出现的新单词的功能。虽然可以基于新语音的样本构建高保真语音合成器，但要正常运行，它们通常需要大量语音数据作为输入以及大量的手动注释。基于叙述的编辑。基本思想是使用标准的预建语音合成器以不同的语音合成输入单词，然后使用语音转换将语音转换为所需的语音。不幸的是，常规的语音转换不能产生用于所述应用的足够质量的合成。因此，本文引入了新的语音转换技术，该技术可以高度个性化和清晰地合成单词。提出了三种方法：第一种方法称为CUTE，这是一种基于帧级单位选择和示例特征的数据驱动的语音转换方法。第二种称为VoCo的方法建立在CUTE的基础上，但进行了一些改进，可帮助将合成的单词更无缝地融合到插入的上下文中。CUTE和VoCo都从语音样本中选择音频帧序列并将它们拼接在一起以近似语音转换。第三种方法通过深度神经网络对VoCo进行了改进。它涉及两个网络：FFTNet根据声学特征生成高质量的波形，TimbreNet将通用合成器语音的声学特征转换为人类语音的声学特征。

著录项

作者
Jin, Zeyu.;
展开▼
作者单位

Princeton University.;

展开▼
授予单位 Princeton University.;
学科 Computer science.
学位 Ph.D.
年度 2018
页码 124 p.
总页数 124
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. VoCo: Text-based Insertion and Replacement in Audio Narration [J] . ZEYU JIN, GAUTHAM J. MYSORE, STEPHEN DIVERDI, ACM Transactions on Graphics . 2017,第4CD期

机译：VoCo：音频旁白中基于文本的插入和替换
2. Classifying Alzheimer's Disease Using Audio and Text-Based Representations of Speech [J] . Rmani Haulcy, James Glass Frontiers in Psychology . 2020,第a期

机译：使用基于音频和文本的语音表示分类Alzheimer的疾病
3. CLOVIS: towards precision-oriented text-based video retrieval through the unification of automatically-extracted concepts and relations of the visual and audio/speech contents [J] . M. Belkhatir Journal of Intelligent Information Systems . 2010,第2期

机译：CLOVIS：通过统一自动提取的概念以及视音频和语音内容之间的关系，实现基于精度的基于文本的视频检索
4. Context-Aware Prosody Correction for Text-Based Speech Editing [C] . Max Morrison, Lucas Rencker, Zeyu Jin, IEEE International Conference on Acoustics, Speech and Signal Processing . 2021

机译：基于文本的语音编辑的背景感知韵律校正
5. A comparison of instructor audio-video with text-based feedback versus text-based feedback alone on students' perceptions of community of inquiry among RN-to-BSN online students [D] . Lindley, Marie Kelly. 2014

机译：RN-BSN在线学生对讲师视听与基于文本的反馈以及仅基于文本的反馈对学生对探究社区感知的比较
6. Classifying Alzheimers Disease Using Audio and Text-Based Representations of Speech [O] . Rmani Haulcy, James Glass 2020

机译：使用基于文本的语音表示分类Alzheimer的疾病
7. Classifying Alzheimer's Disease Using Audio and Text-Based Representations of Speech [O] . Rmani Haulcy, James Glass 2021

机译：使用基于文本的语音表示分类Alzheimer的疾病

Speech Synthesis for Text-Based Editing of Audio Narration

摘要

著录项

相似文献

相关主题

期刊订阅