首页> 外文会议> >Integrating audio and visual information to provide highly robust speech recognition
【24h】

Integrating audio and visual information to provide highly robust speech recognition

机译:集成音频和视频信息以提供高度可靠的语音识别

获取原文

摘要

There is a requirement in many human machine interactions to provide accurate automatic speech recognition in the presence of high levels of interfering noise. The the paper shows that performance improvements in recognition accuracy can be obtained by including data derived from a speaker's lip images. We describe the combination of the audio and visual data in the construction of composite feature vectors and a hidden Markov model structure which allows for asynchrony between the audio and visual components. These ideas are applied to a speaker dependent recognition task involving a small vocabulary and subject to interfering noise. The recognition results obtained using composite vectors and cross-product models are compared with those based on an audio-only feature vector. The benefit of this approach is shown to be an increased performance over a very wide range of noise levels.
机译:在许多人机交互中需要在存在高水平干扰噪声的情况下提供准确的自动语音识别。该论文表明,通过包括从说话人的嘴唇图像中得出的数据,可以提高识别精度的性能。我们在构造复合特征向量和隐藏的马尔可夫模型结构时描述了音频和视频数据的组合,该结构允许音频和视频组件之间的异步。这些想法被应用到与说话者相关的识别任务上,该任务涉及少量的词汇并且容易受到干扰。将使用复合向量和叉积模型获得的识别结果与基于纯音频特征向量的识别结果进行比较。事实证明,这种方法的好处是可以在很大范围的噪声水平上提高性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号