首页> 外文期刊>Information Fusion >CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement
【24h】

CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement

机译:Cochleanet:一个坚固的语言无关视听模型,用于实时语音增强

获取原文
获取原文并翻译 | 示例
           

摘要

Noisy situations cause huge problems for the hearing-impaired, as hearing aids often make speech more audible but do not always restore intelligibility. In noisy settings, humans routinely exploit the audio-visual (AV) nature of speech to selectively suppress background noise and focus on the target speaker. In this paper, we present a novel language-, noise- and speaker-independent AV deep neural network (DNN) architecture, termed CochleaNet, for causal or real-time speech enhancement (SE). The model jointly exploits noisy acoustic cues and noise robust visual cues to focus on the desired speaker and improve speech intelligibility. The proposed SE framework is evaluated using a first of its kind AV binaural speech corpus, ASPIRE, recorded in real noisy environments, including cafeteria and restaurant settings. We demonstrate superior performance of our approach in terms of both objective measures and subjective listening tests, over state-of-the-art SE approaches, including recent DNN based SE models. In addition, our work challenges a popular belief that scarcity of a mull-lingual, large vocabulary AV corpus and a wide variety of noises is a major bottleneck to build robust language, speaker and noise-independent SE systems. We show that a model trained on a synthetic mixture of the benchmark GRID corpus (with 33 speakers and a small English vocabulary) and CHiME 3 noises (comprising bus, pedestrian, cafeteria, and street noises) can generalise well, not only on large vocabulary corpora with a wide variety of speakers and noises, but also on completely unrelated languages such as Mandarin.
机译:嘈杂的情况对听力障碍造成巨大问题,因为助听器经常使演讲更加可听,但并不总是恢复可懂度。在嘈杂的设置中,人类经常利用语音的视听(AV)性质来选择性地抑制背景噪声并专注于目标扬声器。在本文中,我们提出了一种新颖的语言,噪音和独立的AV深神经网络(DNN)架构,被称为CochleaeNet,用于因果或实时语音增强(SE)。该模型共同利用嘈杂的声学线索和噪声强大的视觉提示,专注于所需的扬声器并提高语音清晰度。拟议的SE框架是使用其第一个AV双耳语音语料库,渴望在实际嘈杂的环境中记录的,包括自助餐厅和餐厅设置。我们通过最先进的SE方法展示了我们对客观措施和主观聆听测试方面的卓越性能,包括最先进的SE方法,包括最近的基于DNN的SE模型。此外,我们的工作挑战了一种流行的信念,即跨越语言,大型词汇的AV语料库和各种各样的噪音是构建强大语言,扬声器和无关的SE系统的主要瓶颈。我们表明,在基准网格语料库的合成混合物上培训的模型(带33个发言者和一个小型英语词汇)和Chime 3噪音(包括公共汽车,行人,自助餐厅和街道噪音)可以概括,不仅在大型词汇量上Corpora有各种各样的扬声器和噪音,也是完全无关的语言,如普通话。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号