首页> 外文期刊>Archives of acoustics >A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems
【24h】

A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems

机译:基于CNN基于多媒体ASR系统的伦巴第效应对印地教学单元识别的影响研究

获取原文
获取原文并翻译 | 示例
           

摘要

Research work on the design of robust multimodal speech recognition systems making use of acoustic and visual cues, extracted using the relatively noise robust alternate speech sensors is gaining interest in recent times among the speech processing research fraternity. The primary objective of this work is to study the exclusive influence of Lombard effect on the automatic recognition of the confusable syllabic consonant-vowel units of Hindi language, as a step towards building robust multimodal ASR systems in adverse environments in the context of Indian languages which are syllabic in nature. The dataset for this work comprises the confusable 145 consonant-vowel (CV) syllabic units of Hindi language recorded simultaneously using three modalities that capture the acoustic and visual speech cues, namely normal acoustic microphone (NM), throat microphone (TM) and a camera that captures the associated lip movements. The Lombard effect is induced by feeding crowd noise into the speaker's headphone while recording. Convolutional Neural Network (CNN) models are built to categorise the CV units based on their place of articulation (POA), manner of articulation (MOA), and vowels (under clean and Lombard conditions). For validation purpose, corresponding Hidden Markov Models (HMM) are also built and tested. Unimodal Automatic Speech Recognition (ASR) systems built using each of the three speech cues from Lombard speech show a loss in recognition of MOA and vowels while POA gets a boost in all the systems due to Lombard effect. Combining the three complimentary speech cues to build bimodal and trimodal ASR systems shows that the recognition loss due to Lombard effect for MOA and vowels reduces compared to the unimodal systems, while the POA recognition is still better due to Lombard effect. A bimodal system is proposed using only alternate acoustic and visual cues which gives a better discrimination of the place and manner of articulation than even standard ASR system. Among the multimodal ASR systems studied, the proposed trimodal system based on Lombard speech gives the best recognition accuracy of 98%, 95%, and 76% for the vowels, MOA and POA, respectively, with an average improvement of 36% over the unimodal ASR systems and 9% improvement over the bimodal ASR systems.
机译:利用声学和视觉提示的鲁棒多模语性语音系统设计的研究工作,利用相对噪声稳健的替代语音传感器提取的近期兴趣在语音处理研究兄弟会中获得兴趣。这项工作的主要目标是研究伦巴第效应对印度语言可混淆音节辅音元音单元的独家影响,作为建立在印度语言的不利环境中的强大多模式ASR系统的一步是特性本质上的。该工作的数据集包括可变的145个辅音元音(CV)Syllabic单位的印地文语言,同时使用捕获声学和视觉语音提示,即正常声学麦克风(NM),喉部麦克风(TM)和相机的三种方式进行同时录制的。这捕获了相关的唇部运动。在录制时,通过将人群噪声送入扬声器的耳机时诱导伦巴第效应。构建卷积神经网络(CNN)模型以基于其特征(POA),铰接方式(MOA)和元音(在清洁和伦巴第情况下)来对CV单元进行分类。对于验证目的,还构建并测试了相应的隐藏马尔可夫模型(HMM)。使用来自伦巴第语音的三个语音线索中的每一个建造的单峰自动语音识别(ASR)系统显示了识别MOA和元音的损失,而POA因伦巴第效应而在所有系统中获得升高。结合三种互补语言提示来构建双峰和三极管和Trimodal ASR系统表明,与单峰系统相比,MOA和元音的伦巴第效应引起的识别损失减少,而POA识别仍然由于伦巴第效应仍然更好。仅使用替代声学和视觉提示提出了一种双峰系统,其提供比甚至标准ASR系统更好地辨别铰接的位置和方式。在研究的多模式ASR系统中,基于伦巴第语音的提议的三峰系统,分别为元音,MOA和POA分别提供了98%,95%和76%的最佳识别准确度,平均改善了36%的单峰ASR系统和对双峰ASR系统的9%改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号