首页> 外文期刊>Journal of signal processing systems for signal, image, and video technology >Generalized I-vector Representation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker Verification

Generalized I-vector Representation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker Verification


获取原文并翻译 | 示例


This paper presents a generalized i-vector representation framework with phonetic tokenization and tandem features for text independent as well as text dependent speaker verification. In the conventional i-vector framework, the tokens for calculating the zero-order and first-order Baum-Welch statistics are Gaussian Mixture Model (GMM) components trained from acoustic level MFCC features. Yet besides MFCC, we believe that phonetic information makes another direction that can benefit the system performance. Our contribution in this paper lies in integrating phonetic information into the i-vector representation by several extensions, forming a more generalized i-vector framework. First, the tokens for calculating the zero-order statistics is extended from the MFCC trained GMM components to phonetic phonemes, trigrams and tandem feature trained GMM components, using phoneme posterior probabilities. Second, given the zero-order statistics (posterior probabilities on tokens), the feature used to calculate the first-order statistics is also extended from MFCC to tandem feature, and is not necessarily the same feature employed by the tokenizer. Third, the zero-order and first-order statistics vectors are then concatenated and represented by the simplified supervised i-vector approach followed by the standard Probabilistic Linear Discriminant Analysis (PLDA) back-end. We study different token and feature combinations, and we show that the feature level fusion of acoustic level MFCC features and phonetic level tandem features with GMM based i-vector representation achieves the best performance for text independent speaker verification. Furthermore, we demonstrate that the phonetic level phoneme constraints introduced by the tandem features help the text dependent speaker verification system to reject wrong password trials and improve the performance dramatically. Experimental results are reported on the NIST SRE 2010 common condition 5 female part task and the RSR 2015 part 1 female part task for text independent and text dependent speaker verification, respectively. For the text independent speaker verification task, the proposed generalized i-vector representation outperforms the i-vector baseline by relatively 53 % in terms of equal error rate (EER) and norm minDCF values. For the text dependent speaker verification task, our proposed approach also reduced the EER significantly from 23 % to 90 % relatively for different types of trials.
机译:本文提出了一种通用的i-vector表示框架,该框架具有语音标记化和串联功能,可用于文本独立以及与文本相关的说话人验证。在传统的i-vector框架中,用于计算零阶和一阶Baum-Welch统计数据的标记是从声学级MFCC特征训练出来的高斯混合模型(GMM)组件。然而,除了MFCC,我们相信语音信息将为使系统性能受益的另一个方向。我们在本文中的贡献在于通过几个扩展将语音信息集成到i-vector表示中,从而形成了更通用的i-vector框架。首先,使用音素后验概率,将用于计算零阶统计量的令牌从MFCC训练的GMM组件扩展到语音音素,三字母组和串联特征训练的GMM组件。其次,给定零阶统计量(令牌的后验概率),用于计算一阶统计量的功能也从MFCC扩展到了串联特征,并且不一定与令牌化程序所采用的特征相同。第三,然后将零阶和一阶统计向量连接起来,并通过简化的监督i-向量方法和标准概率线性判别分析(PLDA)后端进行表示。我们研究了不同的标记和特征组合,并且我们表明,声级MFCC特征和语音级串联特征与基于GMM的i-vector表示的特征级融合可实现最佳的文本无关说话者验证性能。此外,我们证明了串接功能引入的语音级别音素约束可帮助依赖文本的说话者验证系统拒绝错误的密码尝试并显着提高性能。针对NIST SRE 2010通用条件5女性部分任务和RSR 2015 part 1女性部分任务分别报告了独立于文本和依赖于文本的说话者验证的实验结果。对于独立于文本的说话人验证任务,就等误码率(EER)和标准minDCF值而言,建议的广义i向量表示比i向量基线要高出53%。对于依赖文本的说话人验证任务,我们建议的方法还可以将不同类型的试验的EER从23%显着降低到90%。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号