首页> 外文会议>International Conference on Audio- and Video-Based Biometric Person Authentication(AVBPA 2005); 20050720-22; Hilton Rye Town,NY(US) >Audio-Visual Speaker Identification via Adaptive Fusion Using Reliability Estimates of Both Modalities
【24h】

Audio-Visual Speaker Identification via Adaptive Fusion Using Reliability Estimates of Both Modalities

机译:使用两种模态的可靠性估计通过自适应融合进行视听说话人识别

获取原文
获取原文并翻译 | 示例

摘要

An audio-visual speaker identification system is described, where the audio and visual speech modalities are fused by an automatic unsupervised process that adapts to local classifier performance, by taking into account the output score based reliability estimates of both modalities. Previously reported methods do not consider that both the audio and the visual modalities can be degraded. The visual modality uses the speakers lip information. To test the robustness of the system, the audio and visual modalities are degraded to emulate various levels of train/test mismatch; employing additive white Gaussian noise for the audio and JPEG compression for the visual signals. Experiments are carried out on a large augmented data set from the XM2VTS database. The results show improved audio-visual accuracies at all tested levels of audio and visual degradation, compared to the individual audio or visual modality accuracies. For high mismatch levels, the audio, visual, and auto-adapted audio-visual accuracies are 37.1%, 48%, and 71.4% respectively.
机译:描述了一种视听说话者识别系统,其中通过考虑基于两种模态的基于输出得分的可靠性估计,通过适应于本地分类器性能的自动无监督过程来融合视听语音模态。先前报道的方法没有考虑到音频和视觉模态都可能降低。视觉模态使用说话者的嘴唇信息。为了测试系统的健壮性,会降低音频和视频的模态以模拟各种级别的训练/测试不匹配;对音频采用加性高斯白噪声,对视觉信号采用JPEG压缩。对来自XM2VTS数据库的大型扩充数据集进行了实验。结果表明,与单个音频或视觉模态精度相比,在所有测试的音频和视频降级水平上,音频-视频精度都有所提高。对于高不匹配度,音频,视觉和自动调整的视听精度分别为37.1%,48%和71.4%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号