Multimodal Attention Fusion for Target Speaker Extraction

机译：目标扬声器提取的多式联版融合

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Target speaker extraction, which aims at extracting a target speaker’s voice from a mixture of voices using audio, visual or locational clues, has received much interest. Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues. Although audio-visual target speaker extraction offers a more stable performance than single modality methods for simulated data, its adaptation towards realistic situations has not been fully explored as well as evaluations on real recorded mixtures. One of the major issues to handle realistic situations is how to make the system robust to clue corruption because in real recordings both clues may not be equally reliable, e.g. visual clues may be affected by occlusions. In this work, we propose a novel attention mechanism for multi-modal fusion and its training methods that enable to effectively capture the reliability of the clues and weight the more reliable ones. Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data. Moreover, we also record an audio-visual dataset of simultaneous speech with realistic visual clue corruption and show that audio-visual target speaker extraction with our proposals successfully work on real data.

机译：目标扬声器提取，旨在使用音频，视觉或位置线索从声音混合中提取目标扬声器的声音，已经获得了很多兴趣。最近，提出了一种视听目标扬声器提取，通过使用互补音频和视觉线索提取目标语音。虽然视听目标扬声器提取提供比模拟数据的单个模态方法更稳定的性能，但其对现实情况的适应尚未完全探索以及对实际记录混合的评估。处理现实情况的主要问题之一是如何使系统对线索腐败的强大，因为在实际记录中，两个线索都可能同样可靠，例如，视觉线索可能受到闭塞的影响。在这项工作中，我们提出了一种新的多模态融合的注意机制及其培训方法，使能有效地捕获线索和重量更可靠的培训方法。我们的提案通过在模拟数据上的传统融合机制上将信号变为失真率（SDR）。此外，我们还记录了一个具有逼真的Visual Clue损坏的同时语音的视听数据集，并显示视听目标扬声器提取与我们的建议成功地处理真实数据。

著录项

来源
《Spoken Language Technology Workshop》|2021年|778-784|共7页
会议地点
作者
Hiroshi Sato; Tsubasa Ochiai; Keisuke Kinoshita; Marc Delcroix; Tomohiro Nakatani; Shoko Araki;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Training; Measurement; Visualization; Speech recognition; Reliability; Proposals; Data mining;

机译：培训;测量;可视化;语音识别;可靠性;提案;数据挖掘;
入库时间 2022-08-26 13:52:51

相似文献

外文文献
中文文献
专利

1. Research on Motion Attention Fusion Model-Based Video Target Detection and Extraction of Global Motion Scene [J] . Long Liu, Boyang Fan, Jing Zhao Journal of Signal and Information Processing . 2013,第3期

机译：基于动画融合模型的视频目标检测与全局运动场景的研究
2. Multimodal Multi-Channel On-Line Speaker Diarization Using Sensor Fusion Through SVM [J] . Peruffo Minotto Vicente, Rosito Jung Claudio, Lee Bowon Multimedia, IEEE Transactions on . 2015,第10期

机译：通过SVM使用传感器融合的多模式多通道在线扬声器数字化
3. SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures [J] . Zmolikova Katerina, Delcroix Marc, Kinoshita Keisuke, Selected Topics in Signal Processing, IEEE Journal of . 2019,第4期

机译：SpeakerBeam：用于语音混合中目标说话人提取的说话人感知神经网络
4. Speaker-Aware Target Speaker Enhancement by Jointly Learning with Speaker Embedding Extraction [C] . Xuan Ji, Meng Yu, Chunlei Zhang, IEEE International Conference on Acoustics, Speech and Signal Processing . 2020

机译：通过与说话人嵌入提取联合学习来增强说话人感知目标说话人
5. Speech-based Affective Computing Using Attention with Multimodal Fusion [D] . Gu, Yue. 2020

机译：基于语音的情感计算，使用多模式融合的注意力
6. Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model [O] . Rehan Ahmad, Syed Zubair, Hani Alquhayz, 2019

机译：使用预训练的视听同步模型进行多模态扬声器二分法
7. Speaker-adapted neural-network-based fusion for multimodal reference resolution [O] . Diana Kleingarn, Nima Nabizadeh, Martin Heckmann, 2019

机译：用于多模式参考分辨率的扬声器适应基于神经网络的融合

Multimodal Attention Fusion for Target Speaker Extraction

摘要

著录项

相似文献

相关主题

期刊订阅