Disentangled Speech Embeddings Using Cross-Modal Self-Supervision

机译：使用跨模态自我监督的纠缠语音嵌入

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to tease apart—without annotation—the representations of linguistic content and speaker identity. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors, offering the potential for greater generalisation to novel combinations of content and identity and ultimately producing speaker identity representations that are more robust.We train our method on a large-scale audio-visual dataset of talking heads ‘in the wild’, and demonstrate its efficacy by evaluating the learned speaker representations for standard speaker recognition performance.

机译：本文的目的是在不访问手动注释数据的情况下学习说话者身份的表示。为此，我们制定了一种自我监督的学习目标，该目标利用了视频中人脸与音频之间的自然交叉模式同步。我们的方法背后的关键思想是在不加注释的情况下挑逗语言内容和说话人身份的表示。我们构建了两个流的体系结构：（1）共享两种表示形式共有的低级功能; （2）提供了一种自然机制来明确区分这些因素，提供了将内容和身份的新颖组合进一步推广的潜力，并最终产生了更为鲁棒的说话人身份表示。 “在野外”会说话的人的数据集，并通过评估学习的说话者表示标准的说话者识别性能来证明其有效性。

著录项

来源
《IEEE International Conference on Acoustics, Speech and Signal Processing》|2020年|6829-6833|共5页
会议地点
作者
Arsha Nagrani; Joon Son Chung; Samuel Albanie; Andrew Zisserman;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
speaker recognition; cross-modal learning; selfsupervised machine learning;

机译：说话人识别;跨模式学习;自我监督机器学习;

相似文献

外文文献
中文文献
专利

1. Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval [J] . Quantum electronics . 2020,第6期

机译：具有自我监督的三元对抗网络，用于零射频跨模型检索
2. Zero-shot learning with self-supervision by shuffling semantic embeddings [J] . Kim Hoseong, Lee Jewook, Byun Hyeran Neurocomputing . 2021,第MAYa21期

机译：通过混洗语义嵌入来自我监督零拍摄学习
3. Cross-modal correspondences in sine wave: Speech versus non-speech modes [J] . Rodrigues Silva Daniel Marcio, Bellini-Leite Samuel C. Attention, perception & psychophysics . 2020,第3期

机译：正弦波中的跨模态对应关系：语音与非语音模式
4. Disentangled Speech Embeddings Using Cross-Modal Self-Supervision [C] . Arsha Nagrani, Joon Son Chung, Samuel Albanie, IEEE International Conference on Acoustics, Speech and Signal Processing . 2020

机译：使用跨莫代尔自我监督解除言乱的演讲嵌入
5. Cross-modal Influences On Speech-In-Noise Perception In Adults With Hearing Loss. [D] . Bradley, Allison 2015

机译：跨模态对成年听力损失成年人的语音噪声感知的影响。
6. Existence detection and embedding rate estimation of blended speech in covert speech communications [O] . Lijuan Li, Yong Gao -1

机译：秘密语音通信中混合语音的存在性检测和嵌入率估计
7. Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition [O] . Abhinav Shukla, Stavros Petridis, Maja Pantic 2021

机译：视觉自我监督是否改善了情感认可的语音表示学习

Disentangled Speech Embeddings Using Cross-Modal Self-Supervision

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅