Transfer Learning Using Raw Waveform Sincnet for Robust Speaker Diarization

机译：使用原始波形SCINNET转移学习，用于强大的扬声器日益改变

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Speaker diarization tells who spoke and when? in an audio stream. SincNet is a recently developed novel convolutional neural network (CNN) architecture where the first layer consists of parameterized sinc filters. Unlike conventional CNNs, SincNet take raw speech waveform as input. This paper leverages SincNet in vanilla transfer learning (VTL) setup. Out-domain data is used for training SincNet-VTL to perform frame-level speaker classification. Trained SincNet-VTL is later utilized as feature extractor for in-domain data. We investigated pooling (max, avg) strategies for deriving utterance-level embedding using frame-level features extracted from trained network. These utterance/segment level embedding are adopted as speaker models during clustering stage in diarization pipeline. We compared the proposed SincNet-VTL embedding with baseline i-vector features. We evaluated our approaches on two corpora, CRSS-PLTL and AMI. Results show the efficacy of trained SincNet-VTL for speaker-discriminative embedding even when trained on small amount of data. Proposed features achieved relative DER improvements of 19.12% and 52.07% for CRSS-PLTL and AMI data, respectively over baseline i-vectors.

机译：扬声器日益改善告诉谁说话和何时？在音频流中。 SINCNET是最近开发的新型卷积神经网络（CNN）架构，第一层由参数化SINC滤波器组成。与传统的CNN不同，SINCNET将原始语音波形作为输入。本文利用香草传输学习（VTL）设置的SINGNET。 OUT域数据用于培训SINCNET-VTL以执行帧级扬声器分类。培训的SINCNET-VTL稍后将用作域内数据的特征提取器。我们调查了使用从培训的网络中提取的帧级别嵌入的发出级别嵌入的汇集（MAX，AVG）策略。这些话语/段级别嵌入作为扬声器模型在日复日复速度管道中的聚类阶段。我们将建议的SincNet-VTL与基线I-Vector特征进行比较。我们评估了我们在两个Corcle，CRSS-PLTL和AMI上的方法。结果表明，即使在少量数据训练时，训练有素的SincNet-VTL用于扬声器歧视性嵌入的功效。在基线I载体上，所提出的特征在基线I - 载体上，CRSS-PLTL和AMI数据的相对改善19.12％和52.07％。

著录项

来源
《IEEE International Conference on Acoustics, Speech and Signal Processing》|2019年|p5996-6664|共5页
会议地点
作者
Harishchandra Dubey; Abhijeet Sangwan; John H. L. Hansen;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TN912-53;
关键词
Speaker Clustering; SincNet; Audio Diarization; Peer-led team learning; Transfer Learning;

机译：扬声器聚类;SINCNET;音频日益增长;同行LED团队学习;转移学习;

相似文献

外文文献
中文文献
专利

1. Wordless Sounds: Robust Speaker Diarization Using Privacy-Preserving Audio Representations [J] . Parthasarathi S. H. K., Bourlard H., Gatica-Perez D. Audio, Speech, and Language Processing, IEEE Transactions on . 2013,第1期

机译：无言的声音：使用保护隐私的音频表示实现鲁棒的扬声器分离
2. Harmonic Structure Features for Robust Speaker Diarization [J] . Yu Zhou, Hongbin Suo, Junfeng Li, ETRI journal . 2012,第4期

机译：谐波结构特性可实现强健的扬声器分离
3. Harmonic Structure Features for Robust Speaker Diarization [J] . Yu Zhou, Hongbin Suo, Junfeng Li, ETRI journal . 2012,第4期

机译：谐波结构特性可实现强健的扬声器分离
4. Transfer Learning Using Raw Waveform Sincnet for Robust Speaker Diarization [C] . Harishchandra Dubey, Abhijeet Sangwan, John H. L. Hansen IEEE International Conference on Acoustics, Speech and Signal Processing . 2019

机译：使用原始波形Sincnet进行转移学习以实现鲁棒的说话人区分
5. Automatic Speaker Recognition and Diarization in Co-Channel Speech [D] . Shokouhi, Navid. 2017

机译：同频道语音中的说话人自动识别和区分
6. Supervised Speaker Diarization Using Random Forests: A Tool for Psychotherapy Process Research [O] . Lukas Fürer, Nathalie Schenk, Volker Roth, 2020

机译：使用随机森林监督扬声器日期：一种心理治疗过程研究的工具
7. Speaker Recognition from Raw Waveform with SincNet [O] . Mirco Ravanelli, Yoshua Bengio 2018

机译：扬声器识别来自原始波形与SINCNET
8. Robust Speech Processing & Recognition: Speaker ID, Language ID, Speech Recognition/Keyword Spotting, Diarization/Co-Channel/Environmental Characterization, Speaker State Assessment. [R] . Hansen, J. H. 2015

机译：强大的语音处理和识别：说话者ID，语言ID，语音识别/关键字识别，Diarization / Co-Channel /环境表征，说话者状态评估。

Transfer Learning Using Raw Waveform Sincnet for Robust Speaker Diarization

摘要

著录项

相似文献

相关主题

期刊订阅