首页> 外文OA文献 >Corpus-based unit selection for natural-sounding speech synthesis
【2h】

Corpus-based unit selection for natural-sounding speech synthesis

机译:基于语料库的单元选择,用于自然发声的语音合成

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。
获取外文期刊封面目录资料

摘要

Speech synthesis is an automatic encoding process carried out by machine through which symbols conveying linguistic information are converted into an acoustic waveform. In the past decade or so, a recent trend toward a non-parametric, corpus-based approach has focused on using real human speech as source material for producing novel natural-sounding speech. This work proposes a communication-theoretic formulation in which unit selection is a noisy channel through which an input sequence of symbols passes and an output sequence, possibly corrupted due to the coverage limits of the corpus, emerges. The penalty of approximation is quantified by substitution and concatenation costs which grade what unit contexts are interchangeable and where concatenations are not perceivable. These costs are semi-automatically derived from data and are found to agree with acoustic-phonetic knowledge. The implementation is based on a finite-state transducer (FST) representation that has been successfully used in speech and language processing applications including speech recognition. A proposed constraint kernel topology connects all units in the corpus with associated substitution and concatenation costs and enables an efficient Viterbi search that operates with low latency and scales to large corpora. An A* search can be applied in a second, rescoring pass to incorporate finer acoustic modelling. Extensions to this FST-based search include hierarchical and paralinguistic modelling. The search can also be used in an iterative feedback loop to record new utterances to enhance corpus coverage. This speech synthesis framework has been deployed across various domains and languages in many voices, a testament to its flexibility and rapid prototyping capability.
机译:语音合成是由机器执行的自动编码过程,通过该过程,传达语言信息的符号被转换为声音波形。在过去的十年左右的时间里,基于非语料库的非参数方法的最新趋势集中在使用真实的人类语音作为产生新颖自然声音的语音源。这项工作提出了一种通信理论的表述,其中单元选择是一个嘈杂的通道,符号的输入序列通过该通道通过,并且出现了可能由于语料库的覆盖范围限制而破坏的输出序列。近似的惩罚是通过替换和串联成本来量化的,该成本对哪些单位上下文可互换以及在哪些串联环境中不可感知进行分级。这些成本是从数据中自动得出的,并且与声学知识一致。该实现基于有限状态换能器(FST)表示形式,该表示形式已成功用于包括语音识别在内的语音和语言处理应用程序中。提出的约束内核拓扑将语料库中的所有单元与相关的替换和连接成本连接起来,并实现了高效的Viterbi搜索,该搜索操作时延低且可扩展到大型语料库。可以在第二次记录遍历中应用A *搜索,以合并更精细的声学模型。这种基于FST的搜索的扩展包括分层和副语言建模。该搜索还可以在迭代反馈循环中使用,以记录新的话语以增强语料库的覆盖范围。这个语音合成框架已经以各种声音部署在各个领域和语言中,证明了其灵活性和快速的原型制作能力。

著录项

  • 作者

    Yi Jon Rong-Wei 1975-;

  • 作者单位
  • 年度 2003
  • 总页数
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号