Visually Guided Self Supervised Learning of Speech Representations

机译：视觉引导自我监督讲话表示的学习

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities. However, most works typically focus on a particular modality or feature alone and there has been very limited work that studies the interaction between the two modalities for learning self supervised representations. We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. Through this process, the audio encoder network learns useful speech representations that we evaluate on emotion recognition and speech recognition. We achieve state of the art results for emotion recognition and competitive results for speech recognition. This demonstrates the potential of visual supervision for learning audio representations as a novel way for self-supervised learning which has not been explored in the past. The proposed unsupervised audio features can leverage a virtually unlimited amount of training data of unlabelled audiovisual speech and have a large number of potentially promising applications.

机译：自我监督的代表学习最近吸引了对音频和视觉方式的许多研究兴趣。然而，大多数作品通常专注于单独的特定模态或特征，并且有很有限的工作，研究了学习自我监督表示的两个方式之间的相互作用。我们提出了一个学习音频表示的框架，以在视听语音的上下文中由视觉模型引导。我们采用了一种生成的音频到视频训练方案，其中我们将与给定音频剪辑对应的静止图像设置动画，并优化所生成的视频以尽可能接近语音段的真实视频。通过这个过程，音频编码器网络了解我们对情感识别和语音识别的有用语音表示。我们实现了最先进的态度，以便情感认可和竞争结果进行语音识别。这证明了对学习音频表示的视觉监督作为自我监督学习的新方法，这在过去尚未探讨。提出的无监督的音频功能可以利用几乎无限量的未标记的视听语言培训数据，并具有大量潜在的潜在有前途的应用程序。

著录项

来源
《IEEE International Conference on Acoustics, Speech and Signal Processing》|2020年|p6204-6823|共5页
会议地点
作者
Abhinav Shukla; Konstantinos Vougioukas; Pingchuan Ma; Stavros Petridis; Maja Pantic;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TN912-53;
关键词
Self supervised learning; Representation learning; Generative modeling; Audiovisual speech; Cross-modal Supervision;

机译：自我监督学习;代表学习;生成建模;视听言论;跨莫代尔监督;

相似文献

外文文献
中文文献
专利

1. Neural network model develops border ownership representation through visually guided learning [J] . Eguchi Akihiro, Stringer Simon M. Neurobiology of learning and memory . 2016,第期

机译：通过视觉引导学习，神经网络模型开发边界所有权表示
2. Learning visually guided grasping: a test case in sensorimotor learning [J] . Kamon I., Flash T. IEEE transactions on systems, man, and cybernetics. Part A . 1998,第3期

机译：学习视觉引导的抓握：感觉运动学习的测试案例
3. The Benefit of a Visually Guided Beamformer in a Dynamic Speech Task [J] . Virginia Best, Elin Roverud, Timothy Streeter, Trends in Hearing . 2017,第2期

机译：视觉引导波束形成器在动态语音任务中的优势
4. Visually Guided Self Supervised Learning of Speech Representations [C] . Abhinav Shukla, Konstantinos Vougioukas, Pingchuan Ma, IEEE International Conference on Acoustics, Speech and Signal Processing . 2020

机译：视觉指导下的语音表达自我监督学习
5. Representations for Visually Guided Actions [D] . Gupta, Saurabh. 2018

机译：视觉引导动作的表示
6. The Benefit of a Visually Guided Beamformer in a Dynamic Speech Task [O] . Virginia Best, Elin Roverud, Timothy Streeter, 2017

机译：视觉引导波束形成器在动态语音任务中的优势
7. Visually Guided Self Supervised Learning of Speech Representations [O] . Abhinav Shukla, Konstantinos Vougioukas, Pingchuan Ma, 2020

机译：视觉引导自我监督讲话表示的学习
8. Self-Supervised Learning to Visually Detect Terrain Surfaces for Autonomous Robots Operating in Forested Terrain. [R] . Zhou, S., Xi, J., McDaniel, M. W., 2012

机译：在树木丛生的地形中运行的自主机器人可视地检测地形表面的自我监督学习。

Visually Guided Self Supervised Learning of Speech Representations

摘要

著录项

相似文献

相关主题

期刊订阅