In corpus-based speech synthesis, the quality of the synthetic speech critically depends on the speech corpus. Since the high vowel in Japanese might be devoiced in the real speech, we should detect and transcribe them automatically in the corpus construction. In this paper, we apply the HMM-based method, and adopt two kinds of likelihood differences as voicing measures for different focuses. To improve the detection performance, the discriminative training is applied to voiced/ devoiced HMM training. Moreover, some features that can discriminate the voiced/devoiced units, including duration, energy and autocorrelation, are incorporated together with the likelihood differences in several methods. The experiments show different results for each high vowel, i.e. the devoicing is vowel dependent. For the vowel /i/, the discriminative training can improve the detection performance to a certain degree. And by cumulating the voicing features and the likelihood differences with optimized weights, the detection accuracy is improved. But for the vowel /u/, there is very limited improvement, even with the voicing features.
展开▼
机译:在基于语料库的语音合成中,合成语音的质量关键取决于语料库。由于日语中的高元音可能会在真实语音中被清浊,因此我们应该在语料库中自动检测并转录它们。在本文中,我们采用基于HMM的方法,并采用两种似然差异作为针对不同焦点的发声措施。为了提高检测性能,将判别训练应用于有声/清音HMM训练。此外,在几种方法中,结合了可以区分浊音/清音单位的一些功能(包括持续时间,能量和自相关)以及可能性差异。实验显示每个高元音的结果不同,即清音取决于元音。对于元音/ i /,判别训练可以在一定程度上提高检测性能。并且通过以最佳权重累计语音特征和似然差,可以提高检测精度。但是对于元音/ u /,即使具有发声功能,其改进也非常有限。
展开▼