Multimodal learning using 3D audio-visual data for audio-visual speech recognition

机译：使用3D视听数据进行视听语音识别的多模式学习

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Recently, various audio-visual speech recognition (AVSR) systems have been developed by using multimodal learning techniques. One key issue is that most of them are based on 2D audio-visual (AV) corpus with the lower video sampling rate. To address this issue, a 3D AV data set with the higher video sampling rate (up to 100 Hz) is introduced to be used in this paper. Another issue is the requirement of both auditory and visual modalities during the system testing. To address this issue, a visual feature generation based bimodal convolutional neural network (CNN) framework is proposed to build an AVSR system with wider application. In this framework, long short-term memory recurrent neural network (LSTM-RNN) is used to generate the visual modality from the auditory modality, while CNNs are used to integrate these two modalities. On a Mandarin Chinese far-field speech recognition task, when visual modality is provided, significant average character error rate (CER) reduction of about 27% relative was obtained over the audio-only CNN baseline. When visual modality is not available, the proposed AVSR system using the visual feature generation technique outperformed the audio-only CNN baseline by 18.52% relative CER.

机译：最近，已经通过使用多模式学习技术开发了各种视听语音识别（AVSR）系统。一个关键问题是，它们大多数基于具有较低视频采样率的2D视听（AV）语料库。为了解决此问题，本文引入了具有较高视频采样率（最高100 Hz）的3D AV数据集。另一个问题是在系统测试期间对听觉和视觉方式的要求。为了解决这个问题，提出了一种基于视觉特征生成的双峰卷积神经网络（CNN）框架，以构建具有更广泛应用的AVSR系统。在此框架中，长时记忆递归神经网络（LSTM-RNN）用于从听觉模态生成视觉模态，而CNN用于整合这两种模态。在汉语普通话的远场语音识别任务中，当提供视觉模态时，在仅音频的CNN基线上，平均相对字符错误率（CER）降低了约27％。当没有视觉模态时，使用视觉特征生成技术的拟议AVSR系统的相对CER优于纯音频CNN基线。

著录项

来源
《International conference on Asian language processing》|2017年|40-43|共4页
会议地点 Singapore(SG)
作者
Rongfeng Su; Lan Wang; Xunying Liu;
展开▼
作者单位

CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences Shenzhen China;

The Chinese University of Hong Kong Hong Kong China;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Visualization; Three-dimensional displays; Speech recognition; Task analysis; Training; Acoustics; Testing;

机译：可视化；三维显示器；语音识别;任务分析；训练;声学;测试中;

相似文献

外文文献
中文文献
专利

1. Noise robust speech recognition system using multimodal audio-visual approach using different deep learning classification techniques [J] . Eslam E. El Maghraby, Amr M. Gody, Mohamed Hesham Farouk International Journal of Advanced Computer Research . 2020,第47期

机译：利用不同深度学习分类技术，使用多模式视听方法的噪声强大语音识别系统
2. An audio-visual corpus for multimodal automatic speech recognition [J] . Czyzewski Andrzej, Kostek Bozena, Bratoszewski Piotr, Journal of Intelligent Information Systems . 2017,第2期

机译：用于多模式自动语音识别的视听语料库
3. Metric Learning-Based Multimodal Audio-Visual Emotion Recognition [J] . Ghaleb Esam, Popa Mirela, Asteriadis Stylianos IEEE multimedia . 2020,第1期

机译：基于度量学习的多峰视听情感识别
4. Multimodal learning using 3D audio-visual data for audio-visual speech recognition [C] . Rongfeng Su, Lan Wang, Xunying Liu International Conference on Asian Language Processing . 2017

机译：使用3D视听数据进行视听语音识别的多模式学习
5. A multimodal sensor fusion architecture for audio-visual speech recognition. [D] . Makkook, Mustapha A. 2007

机译：用于视听语音识别的多模式传感器融合体系结构。
6. Analysis of different affective state multimodal recognition approaches with missing data-oriented to virtual learning environments [O] . Camilo Salazar, Edwin Montoya-Múnera, Jose Aguilar 2021

机译：缺少数据导向到虚拟学习环境的不同情感多模式识别方法分析
7. Noise-Robust Speech Recognition System based on Multimodal Audio-Visual Approach Using Different Deep Learning Classification Techniques [O] . Eslam ElMaghraby, Amr Gody, Mohamed Farouk 2020

机译：基于不同深度学习分类技术的多模式视听方法的噪声鲁棒语音识别系统

Multimodal learning using 3D audio-visual data for audio-visual speech recognition

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅