Noise robust speech recognition system using multimodal audio-visual approach using different deep learning classification techniques

Eslam E. El Maghraby; Amr M. Gody; Mohamed Hesham Farouk

首页> 外文期刊>International Journal of Advanced Computer Research >Noise robust speech recognition system using multimodal audio-visual approach using different deep learning classification techniques

【24h】

Noise robust speech recognition system using multimodal audio-visual approach using different deep learning classification techniques

机译：利用不同深度学习分类技术，使用多模式视听方法的噪声强大语音识别系统

获取原文

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Multimodal speech recognition is proved to be one of the most promising solutions for designing robust speech recognition system, especially when the audio signal is corrupted by noise. The visual signal can be used to obtain more information to enhance the recognition accuracy in a noisy system, whereas the reliability of the visual signal is not affected by the acoustic noise. The critical stage in designing a robust speech recognition system is the choice of an appropriate feature extraction method for both audio and visual signal and the choice of a reliable classification method from a large variety of existing classification techniques. This paper proposes an Audio-Visual Speech Recognition (AV-ASR) system using both audio and visual speech modalities to improve recognition accuracy in a clean and noisy environment. The contributions of this paper are two-folded: The first is the methodology of choosing the visual features by comparing different features extraction methods like discrete cosine transform (DCT), blocked DCT, and histograms of oriented gradients with local binary patterns (HOG+LBP), and applying different dimension reduction techniques like principal component analysis (PCA), auto-encoder, linear discriminant analysis (LDA), t-distributed Stochastic neighbor embedding (t-SNE) to find the most effective features vector size. These features are then early integrated with audio features obtained by Mel frequency Cepstral coefficients (MFCCs) and feed into classification process. The second contribution of this research is the methodology of developing the classification process using deep learning, comparing different deep neural network (DNN) architectures like bidirectional long-short term memory (BiLSTM), and convolution neural network (CNN), with the traditional hidden Markov models (HMM).The effectiveness of the proposed model is demonstrated on two multi-speakers AV-ASR benchmark datasets named AVletters and GRID with different SNR. The model performs speaker-independent experiments in AVlettter dataset and speaker-dependent for the GRID dataset. The experimental results show that early integration between audio feature obtained by a MFCC and visual feature obtained by DCT demonstrate higher recognition accuracy when used with BiLSTM classifier compared to other methods for features extraction and classification techniques. In case of GRID, using integrated audio-visual features achieved highest recognition accuracy of 99.13% and 98.47%, with enhancement up to 9.28% and 12.05% over audio-only for clean and noisy data respectively. For AVletters, the highest recognition accuracy is 93.33% with enhancement up to 8.33% over audio-only. The obtained results show the performance enhancement compared to previously obtain audio-visual recognition accuracies on GRID and AVletters and prove the robustness of our BiLSTM-AV-ASR model when compared with CNN and HMM, because BiLSTM takes into account the sequential characteristics of the speech signal.

机译：被证明是多模式语音识别是设计强大的语音识别系统的最有希望的解决方案之一，尤其是当音频信号因噪声损坏时。可视信号可用于获得更多信息以在噪声系统中提高识别精度，而视觉信号的可靠性不受声学噪声的影响。设计强大的语音识别系统的临界阶段是对音频和视觉信号的适当特征提取方法以及从大量现有分类技术的选择方法的选择。本文提出了一种使用音频和视觉语音模当的视听语音识别（AV-ASR）系统，以提高干净和嘈杂的环境中的识别准确性。本文的贡献是两折：首先是通过比较不同的特征提取方法，如离散余弦变换（DCT），阻塞DCT等不同的特征提取方法选择视觉特征的方法，以及带有当地二进制模式的导向梯度的直方图（HOG + LBP ），并应用不同的尺寸减少技术，如主成分分析（PCA），自动编码器，线性判别分析（LDA），T分布式随机邻居嵌入（T-SNE），以找到最有效的特征矢量尺寸。然后早期集成这些特征，与由MEL频率谱系数（MFCC）获得的音频特征集成并进入分类过程。该研究的第二次贡献是使用深度学习开发分类过程的方法，比较不同的深度神经网络（DNN）架构，如双向长期记忆（BILSTM），以及卷积神经网络（CNN），具有传统的隐藏马尔可夫模型（HMM）。拟议模型的有效性在两个名为Avletters和Grid的两个多扬声器AV-ASR基准数据集上展示了不同的SNR。该模型在Avlettter DataSet和扬声器上执行扬声器独立的实验，依赖于网格数据集。实验结果表明，通过DCT获得的MFCC和视觉特征获得的音频特征之间的早期集成证明与Bilstm分类器相比，与特征提取和分类技术的其他方法一起使用时，表现出较高的识别精度。在网格的情况下，使用集成视听功能的最高识别准确度为99.13％和98.47％，增强高达9.28％和12.05％，分别用于清洁和嘈杂的数据。对于AVLetters，最高识别准确度为93.33％，增强速度高达8.33％。获得的结果表明，与先前获得网格和AVLETERS的视听识别精度相比，该结果显示了性能提升，并与CNN和HMM相比，证明了我们Bilstm-AV-ASR模型的鲁棒性，因为Bilstm考虑了语音的顺序特征信号。

著录项

来源
《International Journal of Advanced Computer Research》 |2020年第47期|共21页
作者
Eslam E. El Maghraby; Amr M. Gody; Mohamed Hesham Farouk;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词
AV-ASRDCTBlocked DCTPCAMFCCHMMBiLSTMCNNAVletters and GRID.;

机译：AV-ASRDCTBLOCKED DCTPCAMFCCHMMBILSTMCNAVLETTERS和网格。;

相似文献

外文文献
中文文献
专利

1. An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition [J] . Bo Wu, Kehuang Li, Fengpei Ge, Selected Topics in Signal Processing, IEEE Journal of . 2017,第8期

机译：端到端深度学习方法可同时进行语音去混响和声学建模，以实现可靠的语音识别
2. A Spectral Masking Approach to Noise-Robust Speech Recognition Using Deep Neural Networks [J] . Li B., Sim K.C. Audio, Speech, and Language Processing, IEEE Transactions on . 2014,第8期

机译：深度神经网络的语音鲁棒语音识别频谱掩蔽方法
3. A new on-line robust approach to design noise-immune speech recognition systems [J] . Fabian Vargas, Rubem D. R. Fagundes, Daniel Barros Jr. Journal of Electronic Testing: Theory and Applications: Theory and Applications . 2003,第1期

机译：设计抗噪声语音识别系统的新型在线鲁棒方法
4. Audio-visual deep learning for noise robust speech recognition [C] . Huang Jing, Kingsbury Brian IEEE International Conference on Acoustics, Speech and Signal Processing . 2013

机译：视听深度学习可增强语音抗噪能力
5. A multimodal fusion approach for automatic postal address recognition system using Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) techniques. [D] . Singh, Amriteshwar. 2011

机译：一种使用光学字符识别（OCR）和自动语音识别（ASR）技术的自动邮政地址识别系统的多模式融合方法。
6. Deep Learning Approach for Multimodal Biometric Recognition System Based on Fusion of Iris Face and Finger Vein Traits [O] . Nada Alay, Heyam H. Al-Baity 2020

机译：基于虹膜面部和手指静脉特征的多模式生物识别系统的深度学习方法
7. Noise-Robust Speech Recognition System based on Multimodal Audio-Visual Approach Using Different Deep Learning Classification Techniques [O] . Eslam ElMaghraby, Amr Gody, Mohamed Farouk 2020

机译：基于不同深度学习分类技术的多模式视听方法的噪声鲁棒语音识别系统

Noise robust speech recognition system using multimodal audio-visual approach using different deep learning classification techniques

摘要

著录项

相似文献

相关主题

期刊订阅