首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >Disentangled Speech Embeddings Using Cross-Modal Self-Supervision
【24h】

Disentangled Speech Embeddings Using Cross-Modal Self-Supervision

机译:使用跨莫代尔自我监督解除言乱的演讲嵌入

获取原文

摘要

The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to tease apart-without annotation-the representations of linguistic content and speaker identity. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors, offering the potential for greater generalisation to novel combinations of content and identity and ultimately producing speaker identity representations that are more robust. We train our method on a large-scale audio-visual dataset of talking heads `in the wild', and demonstrate its efficacy by evaluating the learned speaker representations for standard speaker recognition performance.
机译:本文的目的是学习扬声器标识的表示,无需访问手动注释数据。 为此,我们开发了一种自我监督的学习目标,利用了界面和音频之间的自然交叉模态同步。 我们的方法背后的关键观点是挑逗 - 没有注释 - 语言内容和扬声器身份的表示。 我们构建了一个双流架构,其中:(1)共享两个表示共同的低级功能; (2)提供了一种明确解除这些因素的自然机制,提供更大的概括内容和身份组合的潜力,并最终产生更强大的发言者身份表现。 我们培训我们在野外的谈话头的大型视听数据集上培训我们的方法,并通过评估学习的扬声器识别性能来展示其功效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号