首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >Multimodal Speaker Adaptation of Acoustic Model and Language Model for Asr Using Speaker Face Embedding
【24h】

Multimodal Speaker Adaptation of Acoustic Model and Language Model for Asr Using Speaker Face Embedding

机译:基于说话人面部嵌入的声态和语言模型多模态说话人自适应

获取原文

摘要

We present an investigation into the adaptation of the acoustic model and the language model for automatic speech recognition (ASR) using speaker face for transcription of a multimedia dataset. We begin by overviewing relevant previous work on the integration of visual signals into ASR systems. Our experimental investigation shows a small improvement in word error rate (WER) for the transcription of a collection of instruction videos using adaptation of the acoustic model and the language model with fixed-length face embedding vectors. We also present potential approaches to integrating human facial information, and body gestures into ASR as further directions for research on this topic.
机译:我们目前针对使用说话人面部进行多媒体数据集转录的自动语音识别(ASR)声学模型和语言模型的适应性进行调查。我们首先概述有关将视觉信号集成到ASR系统中的相关先前工作。我们的实验研究表明,使用声学模型和语言模型(带有固定长度的面部嵌入矢量)进行自适应,可以将指令视频集合的转录在单词错误率(WER)方面有所改善。我们还提出了将人类面部信息和身体手势整合到ASR中的潜在方法,作为有关此主题的进一步研究方向。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号