首页> 外文会议>Image Processing, 2005. ICIP 2005. IEEE International Conference on >Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance
【24h】

Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance

机译:MPEG-4面部动画参数组在视听语音识别性能方面的比较

获取原文

摘要

In this paper, we describe an audio-visual automatic speech recognition (AV-ASR) system that utilizes facial animation parameters (FAPs), supported by the MPEG-4 standard, for the visual representation of speech. We describe the visual feature extraction algorithms used for extracting FAPs, which control outer- and inner-lip movement. Principal component analysis (PCA) is performed on both inner- and outer-lip FAP vector in order to decrease their dimensionality and decorrelate them. The PCA-based projection weights of the extracted FAP vectors are used as visual features. Multi-stream hidden Markov models (HMMs) and a late integration approach are used to integrate audio and visual information and train a continuous AV-ASR system. We compare the performance of the developed AV-ASR system utilizing outer- and inner lip FAPs, individually and jointly. Experiments were performed for different dimensionalities of the visual features, at various SNRs (0-30dB) with additive white Gaussian noise, on a relatively large vocabulary (approximately 1000 words) database. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only ASR WERs. Conclusions are drawn on the individual and combined effectiveness of the inner- and outer-lip FAPs, the trade off between the dimensionality of the visual features and the amount of speechreading information contained in them and its influence on the AV-ASR performance.
机译:在本文中,我们描述了一种视听自动语音识别(AV-ASR)系统,该系统利用MPEG-4标准支持的面部动画参数(FAP)来进行语音的视觉表示。我们描述了用于提取FAP的视觉特征提取算法,该算法控制外嘴唇和内嘴唇的运动。对内部和外部嘴唇FAP向量都执行主成分分析(PCA),以减小它们的维数并将它们去相关。提取的FAP向量的基于PCA的投影权重用作视觉特征。多流隐马尔可夫模型(HMM)和后期集成方法用于集成音频和视频信息并训练连续的AV-ASR系统。我们分别和联合比较了使用外唇和内唇FAP的已开发AV-ASR系统的性能。在相对较大的词汇量(约1000个单词)数据库上,以各种SNR(0-30dB)和加性高斯白噪声对视觉特征的不同维度进行了实验。相对于仅使用音频的ASR WER,拟议的系统将字错误率(WER)降低了20%至23%。得出以下结论:内唇和外唇FAP的单独和组合有效性,视觉特征的维数与其中包含的语音朗读信息的数量之间的权衡以及对AV-ASR性能的影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号