Disentangled Speech Embeddings Using Cross-Modal Self-Supervision

机译：使用跨莫代尔自我监督解除言乱的演讲嵌入

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to tease apart-without annotation-the representations of linguistic content and speaker identity. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors, offering the potential for greater generalisation to novel combinations of content and identity and ultimately producing speaker identity representations that are more robust. We train our method on a large-scale audio-visual dataset of talking heads `in the wild', and demonstrate its efficacy by evaluating the learned speaker representations for standard speaker recognition performance.

机译：本文的目的是学习扬声器标识的表示，无需访问手动注释数据。为此，我们开发了一种自我监督的学习目标，利用了界面和音频之间的自然交叉模态同步。我们的方法背后的关键观点是挑逗 - 没有注释 - 语言内容和扬声器身份的表示。我们构建了一个双流架构，其中：（1）共享两个表示共同的低级功能; （2）提供了一种明确解除这些因素的自然机制，提供更大的概括内容和身份组合的潜力，并最终产生更强大的发言者身份表现。我们培训我们在野外的谈话头的大型视听数据集上培训我们的方法，并通过评估学习的扬声器识别性能来展示其功效。

著录项

来源
《IEEE International Conference on Acoustics, Speech and Signal Processing》|2020年|p6824-7443|共5页
会议地点
作者
Arsha Nagrani; Joon Son Chung; Samuel Albanie; Andrew Zisserman;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TN912-53;
关键词
speaker recognition; cross-modal learning; selfsupervised machine learning;

机译：扬声器识别;跨模型学习;自我经验机学习;

相似文献

外文文献
中文文献
专利

1. Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval [J] . Quantum electronics . 2020,第6期

机译：具有自我监督的三元对抗网络，用于零射频跨模型检索
2. Zero-shot learning with self-supervision by shuffling semantic embeddings [J] . Kim Hoseong, Lee Jewook, Byun Hyeran Neurocomputing . 2021,第MAYa21期

机译：通过混洗语义嵌入来自我监督零拍摄学习
3. Cross-modal correspondences in sine wave: Speech versus non-speech modes [J] . Rodrigues Silva Daniel Marcio, Bellini-Leite Samuel C. Attention, perception & psychophysics . 2020,第3期

机译：正弦波中的跨模态对应关系：语音与非语音模式
4. Disentangled Speech Embeddings Using Cross-Modal Self-Supervision [C] . Arsha Nagrani, Joon Son Chung, Samuel Albanie, IEEE International Conference on Acoustics, Speech and Signal Processing . 2020

机译：使用跨模态自我监督的纠缠语音嵌入
5. Cross-modal Influences On Speech-In-Noise Perception In Adults With Hearing Loss. [D] . Bradley, Allison 2015

机译：跨模态对成年听力损失成年人的语音噪声感知的影响。
6. Existence detection and embedding rate estimation of blended speech in covert speech communications [O] . Lijuan Li, Yong Gao -1

机译：秘密语音通信中混合语音的存在性检测和嵌入率估计
7. Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition [O] . Abhinav Shukla, Stavros Petridis, Maja Pantic 2021

机译：视觉自我监督是否改善了情感认可的语音表示学习

Disentangled Speech Embeddings Using Cross-Modal Self-Supervision

摘要

著录项

相似文献

相关主题

期刊订阅