首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >Disentangled Speech Embeddings Using Cross-Modal Self-Supervision
【24h】

Disentangled Speech Embeddings Using Cross-Modal Self-Supervision

机译:使用跨模态自我监督的纠缠语音嵌入

获取原文
获取外文期刊封面目录资料

摘要

The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to tease apart—without annotation—the representations of linguistic content and speaker identity. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors, offering the potential for greater generalisation to novel combinations of content and identity and ultimately producing speaker identity representations that are more robust.We train our method on a large-scale audio-visual dataset of talking heads ‘in the wild’, and demonstrate its efficacy by evaluating the learned speaker representations for standard speaker recognition performance.
机译:本文的目的是在不访问手动注释数据的情况下学习说话者身份的表示。为此,我们制定了一种自我监督的学习目标,该目标利用了视频中人脸与音频之间的自然交叉模式同步。我们的方法背后的关键思想是在不加注释的情况下挑逗语言内容和说话人身份的表示。我们构建了两个流的体系结构:(1)共享两种表示形式共有的低级功能; (2)提供了一种自然机制来明确区分这些因素,提供了将内容和身份的新颖组合进一步推广的潜力,并最终产生了更为鲁棒的说话人身份表示。 “在野外”会说话的人的数据集,并通过评估学习的说话者表示标准的说话者识别性能来证明其有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号