首页> 外文期刊>IEEE transactions on audio, speech and language processing >Video-Aided Model-Based Source Separation in Real Reverberant Rooms
【24h】

Video-Aided Model-Based Source Separation in Real Reverberant Rooms

机译:真实混响室中基于视频的基于模型的源分离

获取原文
获取原文并翻译 | 示例

摘要

Source separation algorithms that utilize only audio data can perform poorly if multiple sources or reverberation are present. In this paper we therefore propose a video-aided model-based source separation algorithm for a two-channel reverberant recording in which the sources are assumed static. By exploiting cues from video, we first localize individual speech sources in the enclosure and then estimate their directions. The interaural spatial cues, the interaural phase difference and the interaural level difference, as well as the mixing vectors are probabilistically modeled. The models make use of the source direction information and are evaluated at discrete time-frequency points. The model parameters are refined with the well-known expectation-maximization (EM) algorithm. The algorithm outputs time-frequency masks that are used to reconstruct the individual sources. Simulation results show that by utilizing the visual modality the proposed algorithm can produce better time-frequency masks thereby giving improved source estimates. We provide experimental results to test the proposed algorithm in different scenarios and provide comparisons with both other audio-only and audio-visual algorithms and achieve improved performance both on synthetic and real data. We also include dereverberation based pre-processing in our algorithm in order to suppress the late reverberant components from the observed stereo mixture and further enhance the overall output of the algorithm. This advantage makes our algorithm a suitable candidate for use in under-determined highly reverberant settings where the performance of other audio-only and audio-visual methods is limited.
机译:如果存在多个源或混响,则仅利用音频数据的源分离算法可能会表现不佳。因此,在本文中,我们为两通道混响录音提出了一种基于视频辅助模型的源分离算法,其中源被假定为静态。通过利用视频中的提示,我们首先在外壳中定位单个语音源,然后估计其方向。对听觉空间提示,听觉相位差和听觉水平差以及混合矢量进行概率建模。这些模型利用了源方向信息,并在离散的时频点进行了评估。使用众所周知的期望最大化(EM)算法完善模型参数。该算法输出时频掩码,用于重建各个源。仿真结果表明,通过利用视觉模态,所提出的算法可以产生更好的时频掩模,从而提供更好的源估计。我们提供实验结果以在不同的场景下测试该算法,并与其他仅音频和视听算法进行比较,并在合成数据和真实数据上均实现了更高的性能。我们还在算法中包括基于混响的预处理,以便从观察到的立体声混合中抑制后期混响成分,并进一步增强算法的整体输出。这一优势使我们的算法成为在不确定的高混响设置中使用的合适候选者,在这种设置中,其他纯音频和视听方法的性能受到限制。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号