首页> 外文学位 >A multimodal sensor fusion architecture for audio-visual speech recognition.
【24h】

A multimodal sensor fusion architecture for audio-visual speech recognition.

机译:用于视听语音识别的多模式传感器融合体系结构。

获取原文
获取原文并翻译 | 示例

摘要

A key requirement for developing any innovative system in a computing environment is to integrate a sufficiently friendly interface with the average end user. Accurate design of such a user-centered interface, however, means more than just the ergonomics of the panels and displays. It also requires that designers precisely define what information to use and how, where, and when to use it. Recent advances in user-centered design of computing systems have suggested that multimodal integration can provide different types and levels of intelligence to the user interface. The work of this thesis aims at improving speech recognition-based interfaces by making use of the visual modality conveyed by the movements of the lips.; Designing a good visual front end is a major part of this framework. For this purpose, this work derives the optical flow fields for consecutive frames of people speaking. Independent Component Analysis (ICA) is then used to derive basis flow fields. The coefficients of these basis fields comprise the visual features of interest. It is shown that using ICA on optical flow fields yields better classification results than the traditional approaches based on Principal Component Analysis (PCA). In fact, ICA can capture higher order statistics that are needed to understand the motion of the mouth. This is due to the fact that lips movement is complex in its nature, as it involves large image velocities, self occlusion (due to the appearance and disappearance of the teeth) and a lot of non-rigidity.; Another issue that is of great interest to audio-visual speech recognition systems designers is the integration (fusion) of the audio and visual information into an automatic speech recognizer. For this purpose, a reliability-driven sensor fusion scheme is developed. A statistical approach is developed to account for the dynamic changes in reliability. This is done in two steps. The first step derives suitable statistical reliability measures for the individual information streams. These measures are based on the dispersion of the N-best hypotheses of the individual stream classifiers. The second step finds an optimal mapping between the reliability measures and the stream weights that maximizes the conditional likelihood. For this purpose, genetic algorithms are used.; The addressed issues are challenging problems and are substantial for developing an audio-visual speech recognition framework that can maximize the information gather about the words uttered and minimize the impact of noise.
机译:在计算环境中开发任何创新系统的关键要求是将足够友好的界面与普通最终用户集成在一起。但是,这种以用户为中心的界面的准确设计不仅仅意味着面板和显示器的人机工程学。它还要求设计人员精确定义要使用的信息以及如何,在何处以及何时使用它。计算系统以用户为中心的设计的最新进展表明,多模式集成可以为用户界面提供不同类型和级别的智能。本文的工作旨在通过利用嘴唇运动所传达的视觉形态来改善基于语音识别的界面。设计良好的视觉前端是此框架的主要部分。为此,这项工作得出了说话人连续帧的光流场。然后使用独立成分分析(ICA)导出基本流场。这些基场的系数包括感兴趣的视觉特征。结果表明,与基于主成分分析(PCA)的传统方法相比,在光流场上使用ICA可获得更好的分类结果。实际上,ICA可以捕获理解嘴部运动所需的高阶统计量。这是由于以下事实:嘴唇运动本质上是复杂的,因为它涉及较大的图像速度,自闭塞(由于牙齿的出现和消失)和很多不刚度。视听语音识别系统设计人员非常感兴趣的另一个问题是将视听信息集成(融合)到自动语音识别器中。为此,开发了一种可靠性驱动的传感器融合方案。开发了一种统计方法来说明可靠性的动态变化。这分两个步骤完成。第一步,为各个信息流得出适当的统计可靠性度量。这些度量基于各个流分类器的N个最佳假设的离散度。第二步是找到可靠性度量和流权重之间的最佳映射,以使条件似然最大化。为此,使用遗传算法。所解决的问题是具有挑战性的问题,对于开发视听语音识别框架至关重要,该框架可以最大程度地收集有关说出的单词的信息,并最大程度地减少噪声的影响。

著录项

  • 作者

    Makkook, Mustapha A.;

  • 作者单位

    University of Waterloo (Canada).;

  • 授予单位 University of Waterloo (Canada).;
  • 学科 Engineering Electronics and Electrical.
  • 学位 M.A.Sc.
  • 年度 2007
  • 页码 109 p.
  • 总页数 109
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 无线电电子学、电信技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号