首页> 外文OA文献 >Text-To-Speech à base de HMM (Hidden Markov Model) pour le vietnamien : modélisation de la segmentation prosodique, la conception du corpus, la conception du système, et l’évaluation perceptive
【2h】

Text-To-Speech à base de HMM (Hidden Markov Model) pour le vietnamien : modélisation de la segmentation prosodique, la conception du corpus, la conception du système, et l’évaluation perceptive

机译:基于HMM(隐藏马尔可夫模型)的越南语文本转语音:韵律分割建模,语料库设计,系统设计和感知评估

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The thesis objective is to design and build a high quality Hidden Markov Model (HMM-)based Text-To-Speech (TTS) system for Vietnamese – a tonal language. The system is called VTED (Vietnamese TExt-tospeech Development system). In view of the great importance of lexical tones, a “tonophone” – an allophone in tonal context – was proposed as a new speech unit in our TTS system. A new training corpus, VDTS (Vietnamese Di-Tonophone Speech corpus), was designed for 100% coverage of di-phones in tonal contexts (i.e. di-tonophones) using the greedy algorithm from a huge raw text. A total of about 4,000 sentences of VDTS were recorded and pre-processed as a training corpus of VTED.In the HMM-based speech synthesis, although pause duration can be modeled as a phoneme, the appearanceof pauses cannot be predicted by HMMs. Lower phrasing levels above words may not be completely modeled with basic features. This research aimed at automatic prosodic phrasing for Vietnamese TTS using durational clues alone as it appeared too difficult to disentangle intonation from lexical tones. Syntactic blocks, i.e. syntactic phrases with a bounded number of syllables (n), were proposed for predicting final lengthening (n = 6) and pause appearance (n = 10). Improvements for final lengthening were done by some strategies of grouping single syntactic blocks. The quality of the predictive J48-decision-tree model for pause appearance using syntactic blocks combining with syntactic link and POS (Part-Of-Speech) features reached F-score of 81.4% Precision=87.6%, Recall=75.9%), much better than that of the model with only POS (F-score=43.6%)or syntactic link (F-score=52.6%) alone.The architecture of the system was proposed on the basis of the core architecture of HTS with an extension of a Natural Language Processing part for Vietnamese. Pause appearance was predicted by the proposed model. Contextual feature set included phone identity features, locational features, tone-related features, and prosodic features (i.e. POS, final lengthening, break levels). Mary TTS was chosen as a platform for implementing VTED. In the MOS (Mean Opinion Score) test, the first VTED, trained with the old corpus and basic features, was rather good, 0.81 (on a 5 point MOS scale) higher than the previous system – HoaSung (using the non-uniform unit selection with the same training corpus); but still 1.2-1.5 point lower than the natural speech. The quality of the final VTED, trained with the new corpus and prosodic phrasing model, progressed by about 1.04 compared to the first VTED, and its gap with the natural speech was much lessened. In the tone intelligibility test, the final VTED received a high correct rate of 95.4%, only 2.6% lower than the natural speech, and 18% higher than the initial one. The error rate of the first VTED in the intelligibility test with the Latin square design was about 6-12% higher than the natural speech depending on syllable, tone or phone levels. The final one diverged about only 0.4-1.4% from the natural speech.
机译:本文的目的是为越南语(一种声调语言)设计和构建高质量的基于隐马尔可夫模型(HMM)的语音朗读(TTS)系统。该系统称为VTED(越南TExt-语音开发系统)。考虑到词汇音调的重要性,我们在TTS系统中提出了一种“音调器”(一种在语调环境中的同音素)作为新的语音单元。设计了一种新的训练语料库VDTS(越南语双音语音语料库),该技术使用大量原始文本中的贪婪算法为100%覆盖音调环境中的双音素(即双音素)。总共记录了约4,000个VDTS句子并将其作为VTED的训练语料进行预处理。在基于HMM的语音合成中,尽管可以将停顿持续时间建模为一个音素,但是HMM无法预测停顿的出现。单词上方的较低措词级别可能无法完全用基本特征来建模。这项研究旨在仅使用持续性线索就越南语TTS的自动韵律表述,因为很难将语调与词汇声调区分开。为了预测最终的加长(n = 6)和暂停出现(n = 10),提出了句法块,即具有有限个音节数量(n)的句法短语。通过对单个句法块进行分组的一些策略来完成对最终加长的改进。结合语法链接和POS(词性)特征的语法块,用于暂停出现的J48决策树预测模型的质量达到了81.4%的F分数,精度= 87.6%,召回率= 75.9%),很多优于仅具有POS(F-分数= 43.6%)或语法链接(F-分数= 52.6%)的模型。该系统的体系结构是在HTS的核心体系结构基础上提出的,并扩展了越南语的自然语言处理部分。暂停外观是由提出的模型预测的。上下文功能集包括电话识别功能,位置功能,与音有关的功能和韵律功能(即POS,最终加长,中断级别)。 Mary TTS被选为实施VTED的平台。在MOS(平均意见评分)测试中,使用旧语料库和基本功能训练的第一个VTED相当不错,比以前的系统HoaSung(使用非均匀单位)高0.81(以5分MOS评分)。具有相同训练语料的选择);但仍比自然语音低1.2-1.5点。经过新语料库和韵律短语训练的最终VTED的质量比第一个VTED的质量提高了约1.04,并且与自然语音的差距大大减小了。在语音清晰度测试中,最终的VTED的正确率高达95.4%,仅比自然语音低2.6%,比初始语音低18%。根据音节,音调或电话级别,使用拉丁方设计的清晰度测试中的第一个VTED的错误率比自然语音高出大约6-12%。最后一个与自然语音只有大约0.4-1.4%的差异。

著录项

  • 作者

    Nguyen Thi Thu Trang;

  • 作者单位
  • 年度 2015
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号