首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >Utterance-level Aggregation for Speaker Recognition in the Wild
【24h】

Utterance-level Aggregation for Speaker Recognition in the Wild

机译:在野外说话人识别的话语级聚合

获取原文

摘要

The objective of this paper is speaker recognition `in the wild' - where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a `thin-ResNet' trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end. We show that our network achieves state of the art performance by a significant margin on the VoxCeleb1 test set for speaker recognition, whilst requiring fewer parameters than previous methods. We also investigate the effect of utterance length on performance, and conclude that for `in the wild' data, a longer length is beneficial.
机译:本文的目的是扬声器识别`在野外' - 话语可能是可变长度的,也包含不相关的信号。对于此任务的深度网络设计中的重要元素是中继(帧级别)网络的类型,以及时间聚合方法。我们提出了一个强大的扬声器识别深网络,使用“Then-Reset”中继架构,以及基于字典的NetVlad或Ghostvlad层,以聚合在时间的聚合特征,可以训练结束到底。我们表明,我们的网络通过对扬声器识别的VoxceleB1测试设置的重大边缘实现了最重要的余量,同时需要比以前的方法更少的参数。我们还研究了话语长度对性能的影响,并得出结论,对于野外数据,更长的长度是有益的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号