首页> 外文期刊>Image and Vision Computing >Robust face-voice based speaker identity verification using multilevel fusion
【24h】

Robust face-voice based speaker identity verification using multilevel fusion

机译:使用多级融合的基于面部声音的健壮说话人身份验证

获取原文
获取原文并翻译 | 示例
       

摘要

In this paper, we propose a robust multilevel fusion strategy involving cascaded multimodal fusion of audio-lip-face motion, correlation and depth features for biometric person authentication. The proposed approach combines the information from different audio-video based modules, namely: audio-lip motion module, audio-lip correlation module, 2D + 3D motion-depth fusion module, and performs a hybrid cascaded fusion in an automatic, unsupervised and adaptive manner, by adapting to the local performance of each module. This is done by taking the output-score based reliability estimates (confidence measures) of each of the module into account. The module weightings are determined automatically such that the reliability measure of the combined scores is maximised. To test the robustness of the proposed approach, the audio and visual speech (mouth) modalities are degraded to emulate various levels of train/ test mismatch; employing additive white Gaussian noise for the audio and JPEG compression for the video signals. The results show improved fusion performance for a range of tested levels of audio and video degradation, compared to the individual module performances. Experiments on a 3D stereovision database AVOZES show that, at severe levels of audio and video mismatch, the audio, mouth, 3D face, and tri-module (audio-lip motion, correlation and depth) fusion EERs were 42.9%, 32%, 15%, and 7.3%, respectively, for biometric person authentication task.
机译:在本文中,我们提出了一种健壮的多级融合策略,该方案包括级联多模态融合的音频面部表情运动,相关性和深度特征,以进行生物特征识别。所提出的方法结合了来自不同基于音频视频的模块的信息,即:音频嘴唇运动模块,音频嘴唇相关模块,2D + 3D运动深度融合模块,并在自动,无监督和自适应的情况下执行混合级联融合方式,通过适应每个模块的本地性能。这是通过考虑每个模块的基于输出分数的可靠性估计(置信度)来完成的。自动确定模块权重,以使组合分数的可靠性度量最大化。为了测试所提出方法的鲁棒性,将音频和视频语音(嘴)模态降级以模拟各种级别的火车/测试失配;对音频使用加性高斯白噪声,对视频信号采用JPEG压缩。结果显示,与单个模块的性能相比,在一系列测试的音频和视频降级水平下,融合性能得到了改善。在3D立体视觉数据库AVOZES上进行的实验表明,在严重的音频和视频不匹配情况下,音频,嘴巴,3D脸部和三模块(音频-嘴唇运动,相关性和深度)融合EER分别为42.9%,32%,生物特征识别人员身份验证任务分别为15%和7.3%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号