...
首页> 外文期刊>International Journal of Advanced Computer Research >Noise robust speech recognition system using multimodal audio-visual approach using different deep learning classification techniques
【24h】

Noise robust speech recognition system using multimodal audio-visual approach using different deep learning classification techniques

机译:利用不同深度学习分类技术,使用多模式视听方法的噪声强大语音识别系统

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Multimodal speech recognition is proved to be one of the most promising solutions for designing robust speech recognition system, especially when the audio signal is corrupted by noise. The visual signal can be used to obtain more information to enhance the recognition accuracy in a noisy system, whereas the reliability of the visual signal is not affected by the acoustic noise. The critical stage in designing a robust speech recognition system is the choice of an appropriate feature extraction method for both audio and visual signal and the choice of a reliable classification method from a large variety of existing classification techniques. This paper proposes an Audio-Visual Speech Recognition (AV-ASR) system using both audio and visual speech modalities to improve recognition accuracy in a clean and noisy environment. The contributions of this paper are two-folded: The first is the methodology of choosing the visual features by comparing different features extraction methods like discrete cosine transform (DCT), blocked DCT, and histograms of oriented gradients with local binary patterns (HOG+LBP), and applying different dimension reduction techniques like principal component analysis (PCA), auto-encoder, linear discriminant analysis (LDA), t-distributed Stochastic neighbor embedding (t-SNE) to find the most effective features vector size. These features are then early integrated with audio features obtained by Mel frequency Cepstral coefficients (MFCCs) and feed into classification process. The second contribution of this research is the methodology of developing the classification process using deep learning, comparing different deep neural network (DNN) architectures like bidirectional long-short term memory (BiLSTM), and convolution neural network (CNN), with the traditional hidden Markov models (HMM).The effectiveness of the proposed model is demonstrated on two multi-speakers AV-ASR benchmark datasets named AVletters and GRID with different SNR. The model performs speaker-independent experiments in AVlettter dataset and speaker-dependent for the GRID dataset. The experimental results show that early integration between audio feature obtained by a MFCC and visual feature obtained by DCT demonstrate higher recognition accuracy when used with BiLSTM classifier compared to other methods for features extraction and classification techniques. In case of GRID, using integrated audio-visual features achieved highest recognition accuracy of 99.13% and 98.47%, with enhancement up to 9.28% and 12.05% over audio-only for clean and noisy data respectively. For AVletters, the highest recognition accuracy is 93.33% with enhancement up to 8.33% over audio-only. The obtained results show the performance enhancement compared to previously obtain audio-visual recognition accuracies on GRID and AVletters and prove the robustness of our BiLSTM-AV-ASR model when compared with CNN and HMM, because BiLSTM takes into account the sequential characteristics of the speech signal.
机译:被证明是多模式语音识别是设计强大的语音识别系统的最有希望的解决方案之一,尤其是当音频信号因噪声损坏时。可视信号可用于获得更多信息以在噪声系统中提高识别精度,而视觉信号的可靠性不受声学噪声的影响。设计强大的语音识别系统的临界阶段是对音频和视觉信号的适当特征提取方法以及从大量现有分类技术的选择方法的选择。本文提出了一种使用音频和视觉语音模当的视听语音识别(AV-ASR)系统,以提高干净和嘈杂的环境中的识别准确性。本文的贡献是两折:首先是通过比较不同的特征提取方法,如离散余弦变换(DCT),阻塞DCT等不同的特征提取方法选择视觉特征的方法,以及带有当地二进制模式的导向梯度的直方图(HOG + LBP ),并应用不同的尺寸减少技术,如主成分分析(PCA),自动编码器,线性判别分析(LDA),T分布式随机邻居嵌入(T-SNE),以找到最有效的特征矢量尺寸。然后早期集成这些特征,与由MEL频率谱系数(MFCC)获得的音频特征集成并进入分类过程。该研究的第二次贡献是使用深度学习开发分类过程的方法,比较不同的深度神经网络(DNN)架构,如双向长期记忆(BILSTM),以及卷积神经网络(CNN),具有传统的隐藏马尔可夫模型(HMM)。拟议模型的有效性在两个名为Avletters和Grid的两个多扬声器AV-ASR基准数据集上展示了不同的SNR。该模型在Avlettter DataSet和扬声器上执行扬声器独立的实验,依赖于网格数据集。实验结果表明,通过DCT获得的MFCC和视觉特征获得的音频特征之间的早期集成证明与Bilstm分类器相比,与特征提取和分类技术的其他方法一起使用时,表现出较高的识别精度。在网格的情况下,使用集成视听功能的最高识别准确度为99.13%和98.47%,增强高达9.28%和12.05%,分别用于清洁和嘈杂的数据。对于AVLetters,最高识别准确度为93.33%,增强速度高达8.33%。获得的结果表明,与先前获得网格和AVLETERS的视听识别精度相比,该结果显示了性能提升,并与CNN和HMM相比,证明了我们Bilstm-AV-ASR模型的鲁棒性,因为Bilstm考虑了语音的顺序特征信号。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号