Self-Supervised Learning for Audio-Visual Speaker Diarization

机译：自主学习的视听说话人差异

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Speaker diarization, which is to find the speech segments of specific speakers, has been widely used in human-centered applications such as video conferences or human-computer interaction systems. In this paper, we propose a self-supervised audio-video synchronization learning method to address the problem of speaker diarization without massive labeling effort. We improve the previous approaches by introducing two new loss functions: the dynamic triplet loss and the multinomial loss. We test them on a real-world human-computer interaction system and the results show our best model yields a remarkable gain of +8% F1-scores as well as diarization error rate reduction. Finally, we introduce a new large scale audio-video corpus designed to fill the vacancy of audio-video dataset in Chinese.

机译：演讲者二分法（用于查找特定演讲者的语音片段）已广泛用于以人为中心的应用程序，例如视频会议或人机交互系统。在本文中，我们提出了一种自我监督的音视频同步学习方法，以解决说话人差异化的问题，而无需花费大量的标签精力。通过引入两个新的损失函数，我们改进了以前的方法：动态三重态损失和多项式损失。我们在真实的人机交互系统上对其进行了测试，结果表明，我们的最佳模型产生了8％F的显着增益。 1 -得分，以及降低差错率。最后，我们介绍了一种新的大规模音视频语料库，该语料库旨在填补中文的音视频数据集的空缺。

著录项

来源
《IEEE International Conference on Acoustics, Speech and Signal Processing》|2020年|4367-4371|共5页
会议地点
作者
Yifan Ding; Yong Xu; Shi-Xiong Zhang; Yahuan Cong; Liqiang Wang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Speaker diarization; multi-modal learning; self-supervised learning; audio-video synchronization;

机译：说话人差异化;多模式学习;自我监督学习;音视频同步;

相似文献

外文文献
中文文献
专利

1. Audio-visual speaker diarization using fisher linear semi-discriminant analysis [J] . Sarafianos Nikolaos, Giannakopoulos Theodoros, Petridis Sergios Multimedia Tools and Applications . 2016,第1期

机译：基于Fisher线性半判别分析的视听说话人二分法
2. Active Learning Based Constrained Clustering For Speaker Diarization [J] . Chengzhu Yu, John H. L. Hansen Audio, Speech, and Language Processing, IEEE/ACM Transactions on . 2017,第11期

机译：基于主动学习的约束聚类用于说话人区分
3. Hybridization DE with K-means for speaker clustering in speaker diarization of broadcasts news [J] . Dabbabi Karim, Hajji Salah, Cherif Adnen International journal of speech technology . 2019,第4期

机译：与K-means的混合DE用于演讲者广播新闻的演讲者聚类
4. Self-Supervised Learning for Audio-Visual Speaker Diarization [C] . Yifan Ding, Yong Xu, Shi-Xiong Zhang, IEEE International Conference on Acoustics, Speech and Signal Processing . 2020

机译：视听扬声器简化的自我监督学习
5. Automatic Speaker Recognition and Diarization in Co-Channel Speech [D] . Shokouhi, Navid. 2017

机译：同频道语音中的说话人自动识别和区分
6. Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model [O] . Rehan Ahmad, Syed Zubair, Hani Alquhayz, 2019

机译：使用预训练的视听同步模型进行多模态扬声器二分法
7. Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model [O] . 2019

机译：使用预先训练的视听同步模型进行多式扬声器日复速度

Self-Supervised Learning for Audio-Visual Speaker Diarization

摘要

著录项

相似文献

相关主题

期刊订阅