Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Ephrat Ariel; Mosseri Inbar; Lang Oran; Dekel Tali; Wilson Kevin; Hassidim Avinatan; Freeman William T.; Rubinstein Michael

首页> 外文期刊>ACM Transactions on Graphics >Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

【24h】

Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

机译：期待听鸡尾酒会：独立于演讲者的视听模型，用于语音分离

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).

机译：我们提出了一种联合视听模型，用于从诸如其他扬声器和背景噪声的混合声音中分离出单个语音信号。仅使用音频作为输入来解决该任务非常具有挑战性，并且不能提供分离的语音信号与视频中的扬声器的关联。在本文中，我们提出了一个基于深度网络的模型，该模型结合了视觉和听觉信号来解决此任务。视觉功能用于将音频“聚焦”到场景中所需的扬声器上并改善语音分离质量。为了训练我们的联合视听模型，我们引入了AVSpeech，这是一个新的数据集，包含来自网络的数千小时的视频片段。我们证明了我们的方法适用于经典语音分离任务以及涉及激烈采访，嘈杂酒吧和尖叫儿童的现实场景，仅要求用户在视频中指定他们想要讲话的人的面部隔离。在混合语音的情况下，我们的方法显示出优于现有的纯音频语音分离的明显优势。另外，我们的模型是独立于说话者的（训练过一次，适用于任何说话者），比最近依赖于说话者的视听语音分离方法（要求针对每个感兴趣的说话者训练一个单独的模型）产生的效果更好。

著录项

来源
《ACM Transactions on Graphics》 |2018年第4cd期|112.1-112.11|共11页
作者
Ephrat Ariel; Mosseri Inbar; Lang Oran; Dekel Tali; Wilson Kevin; Hassidim Avinatan; Freeman William T.; Rubinstein Michael;
展开▼
作者单位

Google Res, Mountain View, CA 94043 USA;

Google Res, Mountain View, CA 94043 USA;

Google Res, Mountain View, CA 94043 USA;

Google Res, Mountain View, CA 94043 USA;

Google Res, Mountain View, CA 94043 USA;

Google Res, Mountain View, CA 94043 USA;

Google Res, Mountain View, CA 94043 USA;

Google Res, Mountain View, CA 94043 USA;

展开▼
收录信息美国《科学引文索引》(SCI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Audio-Visual; Source Separation; Speech Enhancement; Deep Learning; CNN; BLSTM;

机译：视听;源分离;语音增强;深度学习;CNN;BLSTM;

相似文献

外文文献
中文文献
专利

1. Emotionally conditioning the target-speech voice enhances recognition of the target speech under "cocktail-party" listening conditions [J] . Lu Lingxi, Bao Xiaohan, Chen Jing, Attention, perception & psychophysics . 2018,第4期

机译：情绪调节目标 - 语音语音提高了“鸡尾酒会”聆听条件下的目标演讲的识别
2. The effect of symmetric and asymmetric directional binaural listening on speech understanding with surrounding cocktail party noise [J] . Luca Giuliani, Luca Brayda International journal of speech technology . 2019,第2期

机译：对称和非对称定向双耳聆听对周围鸡尾酒会噪声的语音理解的影响
3. The effect of symmetric and asymmetric directional binaural listening on speech understanding with surrounding cocktail party noise [J] . Luca Giuliani, Luca Brayda International journal of speech technology . 2019,第2期

机译：对称和非对称方向双钟听力对周围鸡尾酒党噪声的言语认识的影响
4. Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments [C] . Guan-Lin Chao, William Chan, Ian Lane Annual Conference of the International Speech Communication Association . 2016

机译：鸡尾酒派对环境中语音识别的扬声器目标视听模型
5. Functional Organization of Human Speech Areas and A Systematic Approach to the Cocktail Party Problem. [D] . Hullett, Patrick W. 2013

机译：人类言语区域的功能组织和鸡尾酒会问题的系统方法。
6. Selective Attention Enhances Beta-Band Cortical Oscillation to Speech under Cocktail-Party Listening Conditions [O] . Yayue Gao, Qian Wang, Yu Ding, 2017

机译：在鸡尾酒会聆听条件下选择性注意增强了语音的Beta波段皮质振荡。
7. Emotionally conditioning the target-speech voice enhances recognition of the target speech under “cocktail-party” listening conditions [O] . Lingxi Lu, Xiaohan Bao, Jing Chen, 2018

机译：情感调节目标语音语音提高了“鸡尾酒会”聆听条件下的目标演讲的识别
8. Across-ear Interference from Parametrically Degraded Synthetic Speech Signals in a Dichotic Cocktail-party Listening Task [R] . Brungart, D. S. , Simpson, B. D. , Darwin, C. J. , 2005

机译：双向鸡尾酒会听力任务中参数退化合成语音信号的跨耳干扰

Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

摘要

著录项

相似文献

相关主题

期刊订阅