首页> 外文会议>International conference on Asian language processing >Multimodal learning using 3D audio-visual data for audio-visual speech recognition
【24h】

Multimodal learning using 3D audio-visual data for audio-visual speech recognition

机译:使用3D视听数据进行视听语音识别的多模式学习

获取原文

摘要

Recently, various audio-visual speech recognition (AVSR) systems have been developed by using multimodal learning techniques. One key issue is that most of them are based on 2D audio-visual (AV) corpus with the lower video sampling rate. To address this issue, a 3D AV data set with the higher video sampling rate (up to 100 Hz) is introduced to be used in this paper. Another issue is the requirement of both auditory and visual modalities during the system testing. To address this issue, a visual feature generation based bimodal convolutional neural network (CNN) framework is proposed to build an AVSR system with wider application. In this framework, long short-term memory recurrent neural network (LSTM-RNN) is used to generate the visual modality from the auditory modality, while CNNs are used to integrate these two modalities. On a Mandarin Chinese far-field speech recognition task, when visual modality is provided, significant average character error rate (CER) reduction of about 27% relative was obtained over the audio-only CNN baseline. When visual modality is not available, the proposed AVSR system using the visual feature generation technique outperformed the audio-only CNN baseline by 18.52% relative CER.
机译:最近,已经通过使用多模式学习技术开发了各种视听语音识别(AVSR)系统。一个关键问题是,它们大多数基于具有较低视频采样率的2D视听(AV)语料库。为了解决此问题,本文引入了具有较高视频采样率(最高100 Hz)的3D AV数据集。另一个问题是在系统测试期间对听觉和视觉方式的要求。为了解决这个问题,提出了一种基于视觉特征生成的双峰卷积神经网络(CNN)框架,以构建具有更广泛应用的AVSR系统。在此框架中,长时记忆递归神经网络(LSTM-RNN)用于从听觉模态生成视觉模态,而CNN用于整合这两种模态。在汉语普通话的远场语音识别任务中,当提供视觉模态时,在仅音频的CNN基线上,平均相对字符错误率(CER)降低了约27%。当没有视觉模态时,使用视觉特征生成技术的拟议AVSR系统的相对CER优于纯音频CNN基线。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号