首页> 外文学位 >Audio-Visual Asynchrony Modeling and Analysis for Speech Alignment and Recognition.
【24h】

Audio-Visual Asynchrony Modeling and Analysis for Speech Alignment and Recognition.

机译:语音对齐和识别的视听异步建模和分析。

获取原文
获取原文并翻译 | 示例

摘要

This work investigates perceived audio-visual asynchrony, specifically anticipatory coarticulation, in which the visual cues (e.g. lip rounding) of a speech sound may occur before the acoustic cues. This phenomenon often gives the impression that the visual and acoustic signals are asynchronous. This effect can be accounted for using models based on multiple hidden Markov models with some synchrony constraints linking states in different modalities, though generally only within phones and not across phone boundaries. In this work, we consider several such models, implemented as dynamic Bayesian networks (DBNs). We study the models' ability to accurately locate audio and viseme (audio and video sub-word units, respectively) boundaries in the audio and video signals, and compare them with human labels of these boundaries. This alignment task is important on its own for purposes of linguistic analysis, as it can serve as an analysis tool and a convenience tool to linguists. Furthermore, these advances in alignment systems can carry over into the speech recognition domain.;This thesis makes several contributions. First, this work presents a new set of manually labeled phonetic boundary data in words expected to display asynchrony, and analysis of the data confirms our expectations about this phenomenon. Second, this work presents a new software program called AVDDisplay which allows the viewing of audio, video, and alignment data simultaneously and in sync. This tool is essential for the alignment analysis detailed in this work. Third, new DBN-based models of audio-visual asynchrony are presented. The newly proposed models consider linguistic context within the asynchrony model. Fourth, alignment experiments are used to compare system performance with the hand-labeled ground truth. Finally, the performance of these models in a speech recognition context is examined. This work finds that the newly proposed models outperform previously suggested asynchrony models for both alignment and recognition tasks.
机译:这项工作研究了感知的视听异步,特别是预期的共鸣,其中语音的视觉提示(例如,嘴唇变圆)可能在听觉提示之前出现。这种现象通常给人以视觉和听觉信号是异步的印象。可以考虑使用基于多个隐马尔可夫模型的模型来解决此问题,该模型具有一些同步约束,这些约束以不同的方式链接状态,尽管通常仅在电话内而不是跨电话边界。在这项工作中,我们考虑了几种这样的模型,它们被实现为动态贝叶斯网络(DBN)。我们研究了模型在音频和视频信号中准确定位音频和视位(分别为音频和视频子词单元)边界的能力,并将其与这些边界的人类标签进行比较。就语言分析的目的而言,此对齐任务本身很重要,因为它可以充当语​​言学家的分析工具和便捷工具。此外,对齐系统的这些进步可以延续到语音识别领域。首先,这项工作提出了一组新的手动标记的语音边界数据,这些数据以预期会显示异步的单词表示,数据分析证实了我们对这一现象的期望。其次,这项工作提出了一个称为AVDDisplay的新软件程序,该程序允许同时同步显示音频,视频和对齐数据。该工具对于这项工作中详细介绍的比对分析至关重要。第三,提出了新的基于DBN的视听异步模型。新提出的模型考虑了异步模型中的语言环境。第四,使用对准实验将系统性能与手工标记的地面真实情况进行比较。最后,检查了这些模型在语音识别环境中的性能。这项工作发现,针对对齐和识别任务,新提出的模型优于先前提出的异步模型。

著录项

  • 作者

    Terry, Louis.;

  • 作者单位

    Northwestern University.;

  • 授予单位 Northwestern University.;
  • 学科 Speech Communication.;Engineering Electronics and Electrical.
  • 学位 Ph.D.
  • 年度 2011
  • 页码 153 p.
  • 总页数 153
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号