首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition
【24h】

Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition

机译:学习动态流权重,用于基于耦合HMM的视听语音识别

获取原文
获取原文并翻译 | 示例

摘要

With the increasing use of multimedia data in communication technologies, the idea of employing visual information in automatic speech recognition (ASR) has recently gathered momentum. In conjunction with the acoustical information, the visual data enhances the recognition performance and improves the robustness of ASR systems in noisy and reverberant environments. In audio-visual systems, dynamic weighting of audio and video streams according to their instantaneous confidence is essential for reliably and systematically achieving high performance. In this paper, we present a complete framework that allows blind estimation of dynamic stream weights for audio-visual speech recognition based on coupled hidden Markov models (CHMMs). As a stream weight estimator, we consider using multilayer perceptrons and logistic functions to map multidimensional reliability measure features to audiovisual stream weights. Training the parameters of the stream weight estimator requires numerous input-output tuples of reliability measure features and their corresponding stream weights. We estimate these stream weights based on oracle knowledge using an expectation maximization algorithm. We define 31-dimensional feature vectors that combine model-based and signal-based reliability measures as inputs to the stream weight estimator. During decoding, the trained stream weight estimator is used to blindly estimate stream weights. The entire framework is evaluated using the Grid audio-visual corpus and compared to state-of-the-art stream weight estimation strategies. The proposed framework significantly enhances the performance of the audio-visual ASR system in all examined test conditions.
机译:随着多媒体数据在通信技术中的越来越多的使用,在自动语音识别(ASR)中使用视觉信息的想法近来得到了发展。结合声学信息,视觉数据可增强识别性能并提高ASR系统在嘈杂和混响环境中的鲁棒性。在视听系统中,根据音频和视频流的瞬时置信度进行动态加权对于可靠而系统地实现高性能至关重要。在本文中,我们提出了一个完整的框架,该框架允许基于耦合的隐马尔可夫模型(CHMM)盲估计动态流权重以进行视听语音识别。作为流权重估算器,我们考虑使用多层感知器和逻辑函数将多维可靠性度量特征映射到视听流权重。训练流权重估计器的参数需要可靠性度量功能及其相应流权重的众多输入输出元组。我们使用期望最大化算法基于oracle知识估计这些流权重。我们定义了31维特征向量,这些特征向量将基于模型和基于信号的可靠性度量结合起来作为流权重估计器的输入。在解码期间,训练后的流权重估计器用于盲目估计流权重。使用Grid视听语料库对整个框架进行评估,并将其与最新的流权重估计策略进行比较。所提出的框架在所有检查的测试条件下均显着增强了视听ASR系统的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号