首页> 外文会议>Odyssey 2010: the speaker and language recognition workshop >An i-vector Extractor Suitable for Speaker Recognition with both Microphone and Telephone Speech
【24h】

An i-vector Extractor Suitable for Speaker Recognition with both Microphone and Telephone Speech

机译:适用于麦克风和电话语音的说话人识别的i向量提取器

获取原文
获取原文并翻译 | 示例

摘要

It is widely believed that speaker verification systems perform better when there is sufficient background training data to deal with nuisance effects of transmission channels. It is also known that these systems perform at their best when the sound environment of the training data is similar to that of the context of use (test context). For some applications however, training data from the same type of sound environment is scarce, whereas a considerable amount of data from a different type of environment is available. In this paper, we propose a new architecture for text-independent speaker verification systems that are satisfactorily trained by virtue of a limited amount of application-specific data, supplemented with a sufficient amount of training data from some other context.rnThis architecture is based on the extraction of parameters (i-vectors) from a low-dimensional space (total variability space) proposed by Dehak [1]. Our aim is to extend Dehak's work to speaker recognition on sparse data, namely microphone speech. The main challenge is to overcome the fact that insufficient application-specific data is available to accurately estimate the total variability covariance matrix. We propose a method based on Joint Factor Analysis (JFA) to estimate microphone eigenchan-nels (sparse data) with telephone eigenchannels (sufficient data).rnFor classification, we experimented with the following two approaches: Support Vector Machines (SVM) and Cosine Distance Scoring (CDS) classifier, based on cosine distances. We present recognition results for the part of female voices in the interview data of the NIST 2008 SRE. The best performance is obtained when our system is fused with the state-of-the-art JFA. We achieve 13% relative improvement on equal error rate and the minimum value of detection cost function decreases from 0.0219 to 0.0164.
机译:人们普遍认为,当有足够的背景训练数据来处理传输通道的有害影响时,说话者验证系统的性能会更好。还已知的是,当训练数据的声音环境与使用环境(测试环境)相似时,这些系统将发挥最佳性能。但是,对于某些应用程序,来自相同类型声音环境的训练数据很少,而来自不同类型环境的大量数据可用。在本文中,我们提出了一种新的体系结构,用于独立于文本的说话者验证系统,该体系结构可通过有限的特定于应用程序的数据进行令人满意的培训,并补充了来自其他上下文的足够数量的培训数据。 Dehak提出的从低维空间(总可变性空间)中提取参数(i矢量)。我们的目标是将Dehak的工作扩展到对稀疏数据(即麦克风语音)的说话者识别。主要挑战是克服以下事实:缺乏足够的特定于应用程序的数据来准确估计总变异性协方差矩阵。我们提出了一种基于联合因子分析(JFA)的方法来估计具有电话特征信道(足够数据)的麦克风特征信道(稀疏数据)。对于分类,我们尝试了以下两种方法:支持向量机(SVM)和余弦距离评分(CDS)分类器,基于余弦距离。我们在NIST 2008 SRE的访谈数据中提供了对女性声音部分的识别结果。当我们的系统与最新的JFA融合在一起时,可以获得最佳性能。我们在均等错误率上实现了13%的相对改进,检测成本函数的最小值从0.0219降低至0.0164。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号