首页> 外文学位 >Robust speaker recognition based on latent variable models.
【24h】

Robust speaker recognition based on latent variable models.

机译:基于潜在变量模型的可靠说话人识别。

获取原文
获取原文并翻译 | 示例

摘要

Automatic speaker recognition in uncontrolled environments is a very challenging task due to channel distortions, additive noise and reverberation. To address these issues, this thesis studies probabilistic latent variable models of short-term spectral information that leverage large amounts of data to achieve robustness in challenging conditions.;Current speaker recognition systems represent an entire speech utterance as a single point in a high-dimensional space. This representation is known as "supervector". This thesis starts by analyzing the properties of this representation. A novel visualization procedure of supervectors is presented by which qualitative insight about the information being captured is obtained. We then propose the use of an overcomplete dictionary to explicitly decompose a supervector into a speaker-specific component and an undesired variability component. An algorithm to learn the dictionary from a large collection of data is discussed and analyzed. A subset of the entries of the dictionary is learned to represent speaker-specific information and another subset to represent distortions. After encoding the supervector as a linear combination of the dictionary entries, the undesired variability is removed by discarding the contribution of the distortion components. This paradigm is closely related to the previously proposed paradigm of Joint Factor Analysis modeling of supervectors. We establish a connection between the two approaches and show how our proposed method provides improvements in terms of computation and recognition accuracy.;An alternative way to handle undesired variability in supervector representations is to first project them into a lower dimensional space and then to model them in the reduced subspace. This low-dimensional projection is known as "i-vector". Unfortunately, i-vectors exhibit non-Gaussian behavior, and direct statistical modeling requires the use of heavy-tailed distributions for optimal performance. These approaches lack closed-form solutions, and therefore are hard to analyze. Moreover, they do not scale well to large datasets. Instead of directly modeling i-vectors, we propose to first apply a non-linear transformation and then use a linear-Gaussian model. We present two alternative transformations and show experimentally that the transformed i-vectors can be optimally modeled by a simple linear-Gaussian model (factor analysis). We evaluate our method on a benchmark dataset with a large amount of channel variability and show that the results compare favorably against the competitors. Also, our approach has closed-form solutions and scales gracefully to large datasets.;Finally, a multi-classifier architecture trained on a multicondition fashion is proposed to address the problem of speaker recognition in the presence of additive noise. A large number of experiments are conducted to analyze the proposed architecture and to obtain guidelines for optimal performance in noisy environments. Overall, it is shown that multicondition training of multi-classifier architectures not only produces great robustness in the anticipated conditions, but also generalizes well to unseen conditions.
机译:由于通道失真,附加噪声和混响,在不受控制的环境中自动识别说话者是一项非常具有挑战性的任务。为了解决这些问题,本论文研究了短期频谱信息的概率潜在变量模型,该模型利用大量数据在挑战性条件下实现鲁棒性。当前的说话人识别系统将整个语音发声表示为高维中的单个点空间。这种表示称为“超向量”。本文首先分析了这种表示的性质。提出了一种新颖的超向量可视化过程,通过该过程可以获得有关所捕获信息的定性见解。然后,我们建议使用超完备字典将超向量显式分解为特定于说话者的分量和不期望的可变性分量。讨论和分析了从大量数据中学习字典的算法。学习字典条目的一个子集表示特定于说话者的信息,另一个子集表示失真。在将超向量编码为字典条目的线性组合之后,通过丢弃失真分量的贡献来消除不需要的可变性。该范例与先前提出的超向量联合因子分析建模范例密切相关。我们在这两种方法之间建立了联系,并说明了我们提出的方法如何在计算和识别准确性方面进行改进;;处理超向量表示中不希望的可变性的另一种方法是首先将它们投影到较低维的空间中,然后对其进行建模在精简子空间中。这种低维投影被称为“ i向量”。不幸的是,i向量表现出非高斯行为,并且直接统计建模需要使用重尾分布才能获得最佳性能。这些方法缺乏封闭形式的解决方案,因此很难分析。而且,它们不能很好地扩展到大型数据集。我们建议先应用非线性变换,然后再使用线性高斯模型,而不是直接对i向量进行建模。我们提出了两个替代变换,并通过实验证明了可以通过简单的线性高斯模型(因子分析)来最佳地建模变换后的i向量。我们在具有大量渠道可变性的基准数据集上评估了我们的方法,并表明结果与竞争对手相比具有优势。同样,我们的方法具有封闭形式的解决方案,并且可以适当地扩展到大型数据集。最后,提出了一种以多条件方式训练的多分类器体系结构,以解决在存在附加噪声的情况下说话人识别的问题。进行了大量实验以分析所提出的体系结构并获得在嘈杂环境中最佳性能的指导原则。总的来说,它表明,多分类器体系结构的多条件训练不仅在预期条件下产生了很好的鲁棒性,而且还很好地推广到了看不见的条件。

著录项

  • 作者

    Garcia-Romero, Daniel.;

  • 作者单位

    University of Maryland, College Park.;

  • 授予单位 University of Maryland, College Park.;
  • 学科 Engineering Electronics and Electrical.
  • 学位 Ph.D.
  • 年度 2012
  • 页码 154 p.
  • 总页数 154
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号