首页> 外文期刊>Selected Topics in Signal Processing, IEEE Journal of >Multi-Modal Multi-Channel Target Speech Separation
【24h】

Multi-Modal Multi-Channel Target Speech Separation

机译:多模态多通道目标语音分离

获取原文
获取原文并翻译 | 示例

摘要

Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers. Previously the use of visual modality for target speech separation has demonstrated great potentials. This work proposes a general multi-modal framework for target speech separation by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements. Also, under this framework, we investigate on the fusion methods for multi-modal joint modeling. A factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi-modalities at embedding level. This method firstly factorizes the mixture audio into a set of acoustic subspaces, then leverages the target's information from other modalities to enhance these subspace acoustic embeddings with a learnable attention scheme. To validate the robustness of proposed multi-modal separation model in practical scenarios, the system was evaluated under the condition that one of the modalities is temporarily missing, invalid or corrupted. Experiments are conducted on a large-scale audio-visual dataset collected from YouTube (to be released) that spatialized by simulated room impulse responses (RIRs). Experiment results illustrate that our proposed multi-modal framework significantly outperforms single-modal and bi-modal speech separation approaches, while can still support real-time processing.
机译:目标语音分离是指从同时讲话者的重叠音频提取目标扬声器的语音。以前,使用视觉模型进行目标语音分离已经表现出很大的潜力。该工作通过利用目标扬声器的所有可用信息,提出了一种用于目标语音分离的一般多模态框架,包括他/她的空间位置,语音特征和唇部运动。此外,在此框架下,我们研究了多模态联合建模的融合方法。提出了一种分解的关注的融合方法来聚合嵌入水平的多模态的高电平语义信息。该方法首先将混合音频分解为一组声学子空间,然后利用目标从其他方式利用其他方式的信息,以增强具有学习注意方案的这些子空间声学嵌入。为了在实际情况下验证所提出的多模态分离模型的鲁棒性,在暂时丢失的一个模式,无效或损坏的情况下评估系统。实验在从YouTube(要释放)收集的大型视听数据集上进行,这些数据集被模拟室脉冲响应(RIRS)空间化。实验结果表明,我们所提出的多模态框架显着优于单模和双模语音分离方法,仍然可以支持实时处理。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号