首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >Self-Supervised Learning for Audio-Visual Speaker Diarization
【24h】

Self-Supervised Learning for Audio-Visual Speaker Diarization

机译:自主学习的视听说话人差异

获取原文

摘要

Speaker diarization, which is to find the speech segments of specific speakers, has been widely used in human-centered applications such as video conferences or human-computer interaction systems. In this paper, we propose a self-supervised audio-video synchronization learning method to address the problem of speaker diarization without massive labeling effort. We improve the previous approaches by introducing two new loss functions: the dynamic triplet loss and the multinomial loss. We test them on a real-world human-computer interaction system and the results show our best model yields a remarkable gain of +8% F1-scores as well as diarization error rate reduction. Finally, we introduce a new large scale audio-video corpus designed to fill the vacancy of audio-video dataset in Chinese.
机译:演讲者二分法(用于查找特定演讲者的语音片段)已广泛用于以人为中心的应用程序,例如视频会议或人机交互系统。在本文中,我们提出了一种自我监督的音视频同步学习方法,以解决说话人差异化的问题,而无需花费大量的标签精力。通过引入两个新的损失函数,我们改进了以前的方法:动态三重态损失和多项式损失。我们在真实的人机交互系统上对其进行了测试,结果表明,我们的最佳模型产生了8%F的显着增益。 1 -得分,以及降低差错率。最后,我们介绍了一种新的大规模音视频语料库,该语料库旨在填补中文的音视频数据集的空缺。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号