首页> 外文期刊>Computer speech and language >Regularization of neural network model with distance metric learning for i-vector based spoken language identification
【24h】

Regularization of neural network model with distance metric learning for i-vector based spoken language identification

机译:基于距离度量学习的神经网络模型正则化,用于基于i-vector的口语识别

获取原文
获取原文并翻译 | 示例
           

摘要

The i-vector representation and modeling technique has been successfully applied in spoken language identification (SLI). The advantage of using the i-vector representation is that any speech utterance with a variable duration length can be represented as a fixed length vector. In modeling, a discriminative transform or classifier must be applied to emphasize the variations correlated to language identity since the i-vector representation encodes several types of the acoustic variations (e.g., speaker variation, transmission channel variation, etc.). Owing to the strong nonlinear discriminative power, the neural network model has been directly used to learn the mapping function between the i-vector representation and the language identity labels. In most studies, only the point-wise feature-label information is fed to the model for parameter learning that may result in model overfitting, particularly when with limited training data. In this study, we propose to integrate pair-wise distance metric learning as the regularization of model parameter optimization. In the representation space of nonlinear transforms in the hidden layers, a distance metric learning is explicitly designed to minimize the pair-wise intra-class variation and maximize the inter-class variation. Using the pair-wise distance metric learning, the i-vectors are transformed to a new feature space, wherein they are much more discriminative for samples belonging to different languages while being much more similar for samples belonging to the same language. We tested the algorithm on an SLI task, and obtained promising results, which outperformed conventional regularization methods.
机译:i矢量表示和建模技术已成功应用于口语识别(SLI)。使用i-vector表示的优点是,任何具有可变持续时间长度的语音都可以表示为固定长度的向量。在建模中,由于i-矢量表示编码了几种类型的声学变化(例如,说话者变化,传输通道变化等),因此必须应用判别变换或分类器来强调与语言身份相关的变化。由于强大的非线性判别能力,神经网络模型已直接用于学习i-vector表示与语言标识标签之间的映射函数。在大多数研究中,仅将逐点特征标签信息馈入模型以进行参数学习,这可能会导致模型过度拟合,尤其是在训练数据有限的情况下。在这项研究中,我们建议将成对距离度量学习整合为模型参数优化的正则化。在隐藏层的非线性变换的表示空间中,明确设计了距离度量学习,以最小化成对的类内变化并最大化类间变化。使用成对距离度量学习,i向量被转换到新的特征空间,其中它们对属于不同语言的样本更具区分性,而对属于同一语言的样本具有更大的相似性。我们在SLI任务上测试了该算法,并获得了有希望的结果,该结果优于常规的正则化方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号