Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance

机译：MPEG-4面部动画参数组在视听语音识别性能方面的比较

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we describe an audio-visual automatic speech recognition (AV-ASR) system that utilizes facial animation parameters (FAPs), supported by the MPEG-4 standard, for the visual representation of speech. We describe the visual feature extraction algorithms used for extracting FAPs, which control outer- and inner-lip movement. Principal component analysis (PCA) is performed on both inner- and outer-lip FAP vector in order to decrease their dimensionality and decorrelate them. The PCA-based projection weights of the extracted FAP vectors are used as visual features. Multi-stream hidden Markov models (HMMs) and a late integration approach are used to integrate audio and visual information and train a continuous AV-ASR system. We compare the performance of the developed AV-ASR system utilizing outer- and inner lip FAPs, individually and jointly. Experiments were performed for different dimensionalities of the visual features, at various SNRs (0-30dB) with additive white Gaussian noise, on a relatively large vocabulary (approximately 1000 words) database. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only ASR WERs. Conclusions are drawn on the individual and combined effectiveness of the inner- and outer-lip FAPs, the trade off between the dimensionality of the visual features and the amount of speechreading information contained in them and its influence on the AV-ASR performance.

机译：在本文中，我们描述了一种视听自动语音识别（AV-ASR）系统，该系统利用MPEG-4标准支持的面部动画参数（FAP）来进行语音的视觉表示。我们描述了用于提取FAP的视觉特征提取算法，该算法控制外嘴唇和内嘴唇的运动。对内部和外部嘴唇FAP向量都执行主成分分析（PCA），以减小它们的维数并将它们去相关。提取的FAP向量的基于PCA的投影权重用作视觉特征。多流隐马尔可夫模型（HMM）和后期集成方法用于集成音频和视频信息并训练连续的AV-ASR系统。我们分别和联合比较了使用外唇和内唇FAP的已开发AV-ASR系统的性能。在相对较大的词汇量（约1000个单词）数据库上，以各种SNR（0-30dB）和加性高斯白噪声对视觉特征的不同维度进行了实验。相对于仅使用音频的ASR WER，拟议的系统将字错误率（WER）降低了20％至23％。得出以下结论：内唇和外唇FAP的单独和组合有效性，视觉特征的维数与其中包含的语音朗读信息的数量之间的权衡以及对AV-ASR性能的影响。

著录项

来源
《Image Processing, 2005. ICIP 2005. IEEE International Conference on》||p.1-4|共4页
会议地点
作者
Aleksic P.S.; Katsaggelos K.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Real Time Facial Expression Recognition System with Applications to Facial Animation in MPEG-4 [J] . Naiwala Pathirannehelage Chandrashiri, Takeshi Naemura, Hiroshi Harashima IEICE Transactions on Information and Systems . 2001,第8期

机译：实时面部表情识别系统及其在MPEG-4面部动画中的应用
2. Dynamic Facial Expression Analysis and Synthesis With MPEG-4 Facial Animation Parameters [J] . Yongmian Zhang, Qiang Ji, Zhiwei Zhu, IEEE Transactions on Circuits and Systems for Video Technology . 2008,第10期

机译：带有MPEG-4面部动画参数的动态面部表情分析和合成
3. Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features [J] . Petar S. Aleksic, Jay J. Williams, Zhilin Wu, EURASIP journal on advances in signal processing . 2002,第11期

机译：使用符合MPEG-4的视觉功能进行视听语音识别
4. COMPARISON OF MPEG-4 FACIAL ANIMATION PARAMETER GROUPS WITH RESPECT TO AUDIO-VISUAL SPEECH RECOGNITION PERFORMANCE [C] . Petar S. Aleksic, Aggelos K. Katsaggelos International Conference on Image Processings . 2005

机译：MPEG-4面部动画参数组关于视听语音识别性能的比较
5. A facial animation model for expressive audio-visual speech. [D] . Somasundaram, Arunachalam. 2006

机译：用于表达视听语音的面部动画模型。
6. Comparison of Speech Recognition and Localization Performance in Bilateral and Unilateral Cochlear Implant Users Matched on Duration of Deafness and Age at Implantation [O] . Camille C. Dunn, Richard S. Tyler, Sarah Oakley, -1

机译：双侧和单侧人工耳蜗植入者的语音识别和定位性能的比较与耳聋的持续时间和植入时的年龄相匹配
7. Product Hmms For Audio-Visual Continuous Speech Recognition Using Facial Animation Parameters [O] . Petar S. Aleksic et al. 2003

机译：用于使用面部动画参数的视听连续语音识别的产品Hmms

Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance

摘要

著录项

相似文献

相关主题

期刊订阅