CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement

Gogate Mandar; Dashtipour Kia; Adeel Ahsan; Hussain Amir

首页> 外文期刊>Information Fusion >CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement

【24h】

CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement

机译：Cochleanet：一个坚固的语言无关视听模型，用于实时语音增强

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Noisy situations cause huge problems for the hearing-impaired, as hearing aids often make speech more audible but do not always restore intelligibility. In noisy settings, humans routinely exploit the audio-visual (AV) nature of speech to selectively suppress background noise and focus on the target speaker. In this paper, we present a novel language-, noise- and speaker-independent AV deep neural network (DNN) architecture, termed CochleaNet, for causal or real-time speech enhancement (SE). The model jointly exploits noisy acoustic cues and noise robust visual cues to focus on the desired speaker and improve speech intelligibility. The proposed SE framework is evaluated using a first of its kind AV binaural speech corpus, ASPIRE, recorded in real noisy environments, including cafeteria and restaurant settings. We demonstrate superior performance of our approach in terms of both objective measures and subjective listening tests, over state-of-the-art SE approaches, including recent DNN based SE models. In addition, our work challenges a popular belief that scarcity of a mull-lingual, large vocabulary AV corpus and a wide variety of noises is a major bottleneck to build robust language, speaker and noise-independent SE systems. We show that a model trained on a synthetic mixture of the benchmark GRID corpus (with 33 speakers and a small English vocabulary) and CHiME 3 noises (comprising bus, pedestrian, cafeteria, and street noises) can generalise well, not only on large vocabulary corpora with a wide variety of speakers and noises, but also on completely unrelated languages such as Mandarin.

机译：嘈杂的情况对听力障碍造成巨大问题，因为助听器经常使演讲更加可听，但并不总是恢复可懂度。在嘈杂的设置中，人类经常利用语音的视听（AV）性质来选择性地抑制背景噪声并专注于目标扬声器。在本文中，我们提出了一种新颖的语言，噪音和独立的AV深神经网络（DNN）架构，被称为CochleaeNet，用于因果或实时语音增强（SE）。该模型共同利用嘈杂的声学线索和噪声强大的视觉提示，专注于所需的扬声器并提高语音清晰度。拟议的SE框架是使用其第一个AV双耳语音语料库，渴望在实际嘈杂的环境中记录的，包括自助餐厅和餐厅设置。我们通过最先进的SE方法展示了我们对客观措施和主观聆听测试方面的卓越性能，包括最先进的SE方法，包括最近的基于DNN的SE模型。此外，我们的工作挑战了一种流行的信念，即跨越语言，大型词汇的AV语料库和各种各样的噪音是构建强大语言，扬声器和无关的SE系统的主要瓶颈。我们表明，在基准网格语料库的合成混合物上培训的模型（带33个发言者和一个小型英语词汇）和Chime 3噪音（包括公共汽车，行人，自助餐厅和街道噪音）可以概括，不仅在大型词汇量上Corpora有各种各样的扬声器和噪音，也是完全无关的语言，如普通话。

著录项

来源
《Information Fusion》 |2020年第1期|共13页
作者
Gogate Mandar; Dashtipour Kia; Adeel Ahsan; Hussain Amir;
展开▼
作者单位

Edinburgh Napier Univ Sch Comp Edinburgh EH10 5DT Midlothian Scotland;

Edinburgh Napier Univ Sch Comp Edinburgh EH10 5DT Midlothian Scotland;

Univ Wolverhampton Sch Math &

Comp Sci Wolverhampton England;

Edinburgh Napier Univ Sch Comp Edinburgh EH10 5DT Midlothian Scotland;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
Audio-Visual; Speech enhancement; Speech separation; Deep learning; Real noisy audio-visual corpus; Speaker independent; Noise-independent; Language-independent; Multi-modal; Hearing aids;

机译：视听;语音增强;言语分离;深入学习;真正的嘈杂视听语料库;扬声器独立;无关;语言无关;多莫代尔;助听器;

相似文献

外文文献
中文文献
专利

1. CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement [J] . Gogate Mandar, Dashtipour Kia, Adeel Ahsan, Information Fusion . 2020,第1期

机译：Cochleanet：一个坚固的语言无关视听模型，用于实时语音增强
2. A Generalized Time–Frequency Subtraction Method for Robust Speech Enhancement Based on Wavelet Filter Banks Modeling of Human Auditory System [J] . Shao Y., Chang C.-H. IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics . 2007,第4期

机译：基于人类听觉系统小波滤波器组建模的鲁棒语音增强通用时频减法
3. Recognition of Noisy Speech: A Comparative Survey of Robust Model Architecture and Feature Enhancement [J] . Bjoern Schuller, Martin Woellmer, Tobias Moosmayr, EURASIP journal on audio, speech, and music processing . 2009,第009期

机译：语音识别：鲁棒模型架构和功能增强的比较调查
4. Using Twin-HMM-Based Audio-Visual Speech Enhancement as a Front-End for Robust Audio-Visual Speech Recognition [C] . Ahmed Hussen Abdelaziz, Steffen Zeiler, Dorothea Kolossa Conference of the International Speech Communication Association . 2013

机译：基于双赫姆的视听语音增强件作为强大的视听语音识别的前端
5. Robust speech processing based on microphone array, audio-visual, and frame selection for in-vehicle speech recognition and in-set speaker recognition. [D] . Zhang, Xianxian. 2005

机译：基于麦克风阵列，视听和帧选择的强大语音处理功能，可实现车载语音识别和内置说话人识别。
6. Audio-Visual Speech Timing Sensitivity Is Enhanced in Cluttered Conditions [O] . Warrick Roseboom, Shinya Nishida, Waka Fujisaki, 2011

机译：视听讲话定时灵敏度提高杂波条件
7. Robust Unsupervised Audio-Visual Speech Enhancement Using a Mixture of Variational Autoencoders [O] . Mostafa Sadeghi, Xavier Alameda-Pineda 2020

机译：使用变形AutoEncoders的混合物强大无监视的视听语音增强
8. A Real-Time Noise Suppression Filter for Speech Enhancement and Robust Channel Vocoding [R] . McAulay, R. J., Malpass, M. L. 1980

机译：一种用于语音增强和鲁棒信道编码的实时噪声抑制滤波器

CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement

摘要

著录项

相似文献

相关主题

期刊订阅