首页> 外文OA文献 >Dynamic Adaptation of Time-Frequency Resolution in Spectral Analysis of Speech Signals
【2h】

Dynamic Adaptation of Time-Frequency Resolution in Spectral Analysis of Speech Signals

机译:语音信号频谱分析中时频分辨率的动态适应

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

In speech parametrization the speech signal spectrum is calculated with a frame of typical length between 15 and 35 msec. This results in a uniform time-frequency resolution that does not conform well to the properties of human hearing. One of the hearing properties is nonlinear frequency resolution which can be approximated with multiresolution spectral analysis. In this thesis the continuous wavelet transform was tried as a possible approach for MFCC extraction which is currently the most common type of parametrization. The resulting spectrum, similar to the human hearing, has the frequency resolution that is better at lower frequencies and gets coarser with increasing frequency. This results in better time resolution at higher frequencies which allows spectral changes to be detected more precisely if they are (also) present at higher frequencies. Lower frequency band can be analyzed with fine frequency resolution at the same time. The comparison of success rate achieved with the same speech recognition systems did not show any advantages of wavelet transform based MFCCs over standard MFCCs. Because computing continuous wavelet transform is computationally quite intensive, discrete wavelet transform based MFCCs were also tested. The use of discrete wavelet transform resulted in a significantly decreased success rate. The use of wavelet transform does not solve the problems related to nonstationarity of speech signal, as the time-frequency resolution of its spectrum is not time dependent.ududIn this dissertation, three approaches for dynamic adapting time-frequency resolution were presented. In every approach, one has to estimate how rapidly the spectrum is changing at a given time. This estimation can be based on known facts about the structure of speech or about production of speech. In the first presented approach, the adaptive time-frequency was achieved by varying the frame length based on the phonetic structure of the speech. For each phoneme, the basic properties of spectrum are known. The spectrum of vowels and some other long phonemes is almost stationary, but spectrum of other phonemes, such as stops changes rapidly. If phonetic structure of speech is known, the time-frequency can be adapted by using appropriate frame length for each phoneme. In speech recognition, the phonetic structure is not known. Therefore, speech recognition needs to be done in two passes. Phonetic structure is unknown in the first pass and a fixed frame length is used for parametrization. In the second pass, the phonetic structure from the first pass is known, and the frame length is selected on its basis. In the second presented approach, the time-frequency resolution was adapted according to Moore’s formula, which describes human’s perception of intensity changes in speech signal. Most of intensity changes are related to sections of speech where temporal resolution is more important than frequency resolution. Larger intensity changes are related to short phonemes, such as burst release in plosives. Intensity changes are also related to phoneme transitions. Therefore, when intensity changes are high, the wideband spectrum is emphasized and when they are low narrowband spectrum is emphasized. Computing intensity changes is far less computationally intensive than determining the phonetic structure in an additional pass. The third approach is based on recognition of voiced and unvoiced speech segments. When voiced speech is produced, the vocal folds need to be closed to obstruct the airflow. Because voiced and unvoiced segments are determined by opening or closing the vocal folds, a voiced segment cannot be very short. Most of voiced phonemes are long and have almost stationary spectrum. In feature extraction longer frame was used on voiced segments and shorter frame on unvoiced segments. All of the three above-mentioned approaches to dynamic time-frequency resolution adapting were tested with the same speech recognition system with two speech databases. Several additive and two convolutive distortions were used to test the robustness. In our experiments, adapting frame length based on phonetic structure of speech proved to be too complicated. It is computationally demanding, and was only tested with the smaller speech database. The success rate was almost unchanged, and robustness decreased slightly in comparison to the original speech recognition system which uses standard MFCCs. Adapting time-frequency resolution to intensity changes resulted in increased success rate and robustness. The improvement was quite large and very consistent. Adapting the frame length according to voiced and unvoiced speech segment improved the robustness and in some experiments the success rate.ud
机译:在语音参数化中,使用典型长度在15到35毫秒之间的帧计算语音信号频谱。这会导致统一的时频分辨率,该分辨率与人的听觉特性不太吻合。听力特性之一是非线性频率分辨率,可以通过多分辨率频谱分析来近似。在本文中,尝试将连续小波变换作为MFCC提取的一种可能方法,这是目前最常见的参数化类型。所得的频谱类似于人的听力,其频率分辨率在较低频率下更好,而在频率增加时变得更粗糙。这样可以在较高的频率下获得更好的时间分辨率,如果频谱变化也以较高的频率出现,则可以更精确地检测到频谱变化。可以同时以较低的频率分辨率分析较低的频带。用相同的语音识别系统获得的成功率的比较并未显示出基于小波变换的MFCC与标准MFCC相比没有任何优势。由于计算连续小波变换的计算量很大,因此还测试了基于离散小波变换的MFCC。离散小波变换的使用导致成功率显着降低。小波变换的使用没有解决语音信号非平稳性问题,因为其频谱的时频分辨率与时间无关。在每种方法中,都必须估算频谱在给定时间变化的速度。该估计可以基于关于语音结构或关于语音产生的已知事实。在第一种提出的方​​法中,通过基于语音的语音结构来改变帧长度来实现自适应时频。对于每个音素,频谱的基本属性是已知的。元音和其他一些长音素的频谱几乎是固定的,但是其他音素(例如停止音)的频谱变化很快。如果已知语音的语音结构,则可以通过为每个音素使用适当的帧长来调整时间频率。在语音识别中,语音结构是未知的。因此,语音识别需要分两次进行。在第一遍中,语音结构是未知的,并且将固定的帧长度用于参数化。在第二遍中,从第一遍开始的语音结构是已知的,并且基于其长度选择帧长。在第二种方法中,时间-频率分辨率是根据摩尔的公式进行调整的,该公式描述了人类对语音信号强度变化的感知。大多数强度变化与语音部分有关,在这些部分中时间分辨率比频率分辨率更重要。较大的强度变化与短音素有关,例如爆破音中的突发释放。强度变化也与音素过渡有关。因此,当强度变化高时,宽带频谱被强调,而当强度变化低时,窄带频谱被强调。计算强度变化远不如在另一遍中确定语音结构那样计算强度大。第三种方法基于对有声和无声语音段的识别。发出带语音的语音时,需要关闭声带以阻止气流。因为浊音和清音段是通过打开或关闭声带来确定的,所以浊音段不能很短。大多数浊音音素很长,并且频谱几乎固定。在特征提取中,在有声段上使用较长的帧,在无声段上使用较短的帧。在具有两个语音数据库的相同语音识别系统中测试了上述三种动态时频分辨率自适应方法。几个加性和两个卷积失真用于测试鲁棒性。在我们的实验中,基于语音的语音结构来调整帧长度被证明太复杂了。它对计算的要求很高,并且仅在较小的语音数据库中进行了测试。与使用标准MFCC的原始语音识别系统相比,成功率几乎没有变化,并且健壮性略有下降。使时频分辨率适应强度变化可提高成功率和鲁棒性。改进很大并且非常一致。根据浊音和清音语音段调整帧长度可以提高鲁棒性,在某些实验中可以提高成功率。

著录项

  • 作者

    Štrancar Andrej;

  • 作者单位
  • 年度 2006
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"sl","name":"Slovene","id":39}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号