首页> 外文学位 >A multimodal sensor fusion architecture for audio-visual speech recognition.

【24h】

A multimodal sensor fusion architecture for audio-visual speech recognition.

机译：用于视听语音识别的多模式传感器融合体系结构。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

A key requirement for developing any innovative system in a computing environment is to integrate a sufficiently friendly interface with the average end user. Accurate design of such a user-centered interface, however, means more than just the ergonomics of the panels and displays. It also requires that designers precisely define what information to use and how, where, and when to use it. Recent advances in user-centered design of computing systems have suggested that multimodal integration can provide different types and levels of intelligence to the user interface. The work of this thesis aims at improving speech recognition-based interfaces by making use of the visual modality conveyed by the movements of the lips.; Designing a good visual front end is a major part of this framework. For this purpose, this work derives the optical flow fields for consecutive frames of people speaking. Independent Component Analysis (ICA) is then used to derive basis flow fields. The coefficients of these basis fields comprise the visual features of interest. It is shown that using ICA on optical flow fields yields better classification results than the traditional approaches based on Principal Component Analysis (PCA). In fact, ICA can capture higher order statistics that are needed to understand the motion of the mouth. This is due to the fact that lips movement is complex in its nature, as it involves large image velocities, self occlusion (due to the appearance and disappearance of the teeth) and a lot of non-rigidity.; Another issue that is of great interest to audio-visual speech recognition systems designers is the integration (fusion) of the audio and visual information into an automatic speech recognizer. For this purpose, a reliability-driven sensor fusion scheme is developed. A statistical approach is developed to account for the dynamic changes in reliability. This is done in two steps. The first step derives suitable statistical reliability measures for the individual information streams. These measures are based on the dispersion of the N-best hypotheses of the individual stream classifiers. The second step finds an optimal mapping between the reliability measures and the stream weights that maximizes the conditional likelihood. For this purpose, genetic algorithms are used.; The addressed issues are challenging problems and are substantial for developing an audio-visual speech recognition framework that can maximize the information gather about the words uttered and minimize the impact of noise.

机译：在计算环境中开发任何创新系统的关键要求是将足够友好的界面与普通最终用户集成在一起。但是，这种以用户为中心的界面的准确设计不仅仅意味着面板和显示器的人机工程学。它还要求设计人员精确定义要使用的信息以及如何，在何处以及何时使用它。计算系统以用户为中心的设计的最新进展表明，多模式集成可以为用户界面提供不同类型和级别的智能。本文的工作旨在通过利用嘴唇运动所传达的视觉形态来改善基于语音识别的界面。设计良好的视觉前端是此框架的主要部分。为此，这项工作得出了说话人连续帧的光流场。然后使用独立成分分析（ICA）导出基本流场。这些基场的系数包括感兴趣的视觉特征。结果表明，与基于主成分分析（PCA）的传统方法相比，在光流场上使用ICA可获得更好的分类结果。实际上，ICA可以捕获理解嘴部运动所需的高阶统计量。这是由于以下事实：嘴唇运动本质上是复杂的，因为它涉及较大的图像速度，自闭塞（由于牙齿的出现和消失）和很多不刚度。视听语音识别系统设计人员非常感兴趣的另一个问题是将视听信息集成（融合）到自动语音识别器中。为此，开发了一种可靠性驱动的传感器融合方案。开发了一种统计方法来说明可靠性的动态变化。这分两个步骤完成。第一步，为各个信息流得出适当的统计可靠性度量。这些度量基于各个流分类器的N个最佳假设的离散度。第二步是找到可靠性度量和流权重之间的最佳映射，以使条件似然最大化。为此，使用遗传算法。所解决的问题是具有挑战性的问题，对于开发视听语音识别框架至关重要，该框架可以最大程度地收集有关说出的单词的信息，并最大程度地减少噪声的影响。

著录项

作者
Makkook, Mustapha A.;
展开▼
作者单位

University of Waterloo (Canada).;

展开▼
授予单位 University of Waterloo (Canada).;
学科 Engineering Electronics and Electrical.
学位 M.A.Sc.
年度 2007
页码 109 p.
总页数 109
原文格式 PDF
正文语种 eng
中图分类无线电电子学、电信技术;
关键词

相似文献

外文文献
中文文献
专利

1. Audio-Visual Speech Separation and Dereverberation With a Two-Stage Multimodal Network [J] . Tan Ke, Xu Yong, Zhang Shi-Xiong, Selected Topics in Signal Processing, IEEE Journal of . 2020,第3期

机译：具有两级多模式网络的视听语音分离和DeReveration
2. Noise robust speech recognition system using multimodal audio-visual approach using different deep learning classification techniques [J] . Eslam E. El Maghraby, Amr M. Gody, Mohamed Hesham Farouk International Journal of Advanced Computer Research . 2020,第47期

机译：利用不同深度学习分类技术，使用多模式视听方法的噪声强大语音识别系统
3. An audio-visual corpus for multimodal automatic speech recognition [J] . Czyzewski Andrzej, Kostek Bozena, Bratoszewski Piotr, Journal of Intelligent Information Systems . 2017,第2期

机译：用于多模式自动语音识别的视听语料库
4. Multimodal information fusion using the iterative decoding algorithm and its application to audio-visual speech recognition [C] . Shivappa, S.T., Rao, Personal, Indoor and Mobile Radio Communications,2005 IEEE 16th International Symposium on . 2008

机译：迭代解码的多峰信息融合及其在视听语音识别中的应用
5. Multimodal fusion with applications to audio-visual speech recognition. [D] . Chu, Stephen Mingyu. 2003

机译：多模式融合及其在视听语音识别中的应用。
6. Sensor Type Axis and Position-Based Fusion and Feature Selection for Multimodal Human Daily Activity Recognition in Wearable Body Sensor Networks [O] . Abeer A. Badawi, Ahmad Al-Kabbany, Heba A. Shaban 2020

机译：传感器类型轴和基于位置的融合以及可穿戴式身体传感器网络中多模式人类日常活动识别的特征选择
7. Multimodal information fusion using the iterative decoding algorithm and its application to audio-visual speech recognition [O] . Bhaskar D. Rao, Mohan M. Trivedi 2008

机译：迭代解码算法的多峰信息融合及其在视听语音识别中的应用

A multimodal sensor fusion architecture for audio-visual speech recognition.

摘要

著录项

相似文献

相关主题

期刊订阅