首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >Improving Deep CNN Networks with Long Temporal Context for Text-Independent Speaker Verification
【24h】

Improving Deep CNN Networks with Long Temporal Context for Text-Independent Speaker Verification

机译:改进具有较长时间上下文的深度CNN网络,以进行与文本无关的说话者验证

获取原文

摘要

Deep CNN networks have shown great success in various tasks for text-independent speaker recognition. In this paper, we explore two approaches for modeling long temporal contexts to improve the performance of the ResNet networks. The first approach is simply integrating the utterance-level mean and variance normalization into the ResNet architecture. Secondly, we combine the BLSTM and ResNet into one unified architecture. The BLSTM layers model long range, supposedly phonetically aware, context information, which could facilitate the ResNet to learn the optimal attention weight and suppress the environmental variations. The BLSTM outputs are projected into multiple-channel feature maps and fed into the ResNet network. Experiments on the VoxCeleb1 and the internal MS-SV tasks show that with attentive pooling, the proposed approaches achieve up to 23-28% relative improvement in EER over a well-trained ResNet.
机译:深入的CNN网络已经在与文本无关的说话人识别的各种任务中取得了巨大的成功。在本文中,我们探索了两种对长时态上下文进行建模的方法,以提高ResNet网络的性能。第一种方法是将话语级别的均值和方差归一化简单地集成到ResNet架构中。其次,我们将BLSTM和ResNet组合成一个统一的体系结构。 BLSTM层对远程范围(假定具有语音意识)的上下文信息进行建模,这可以帮助ResNet学习最佳注意权重并抑制环境变化。 BLSTM输出被投影到多通道特征图中,并馈入ResNet网络。在VoxCeleb1和内部MS-SV任务上进行的实验表明,通过集中的汇总,与经过良好训练的ResNet相比,所提出的方法在EER方面可实现高达23-28%的相对改善。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号